[CLASSLA] We are speaking our mind

Nikola Ljubešić nljubesi at gmail.com
Thu Dec 23 08:31:33 CET 2021


	 	
CLASSLA Mailing List
 
 
 
 
Hi, for the last time in 2021! 

We are so happy to wrap-up this recent surge in CLASSLA reports (there was, and still is, quite a lot to catch up on) — with the news on the first open speech-to-text system for Croatian that we have developed in recent weeks. You might have heard that CLASSLA was working lightly, but persistently, on entering the world of speech throughout 2021 and this is the first tangible result of these bootstrapping efforts. There was some real bootstrapping needed as not a minute of training data was available before we started our journey! And we have just started, so expect many exciting developments in 2022.

You can check out the system at https://huggingface.co/classla/wav2vec2-xls-r-parlaspeech-hr <https://huggingface.co/classla/wav2vec2-xls-r-parlaspeech-hr>. Feel free to try out the examples, but also upload or record your own speech. The system is currently based on (only) 72 hours of transcripts coming from the Croatian parliament, achieving already a rather low word error rate of 13% and character error rate of 5% (yes, in-domain test data). The system is an initial proof-of-concept, but can already be very useful on good-quality speech.

We are continuing our efforts in improving and extending the current dataset, already branded as ParlaSpeech-HR, and plan to publish its initial version in early 2022. We furthermore plan to pilot and improve our system on more demanding tasks, such as transcription of vernaculars. If you have interesting use cases, especially with existing recordings and transcripts for  model adaptation, drop us an e-mail!

The presented results are joint effort of (in order of appearance) Nikola Ljubešić, Ivo-Pavao Jazbec, Vuk Batanović, Lenka Bajčetić, Danijel Korzinek and Peter Rupnik. These results would not have been possible without a wider collaboration around the ParlaMint project, and for that Darja Fišer, Tomaž Erjavec, Maciej Ogrodniczuk and Petya Osenova are to be thanked. Together we really are stronger!

We wish you holidays full of peace and joy and a collaborative 2022!

The CLASSLA team

 
 
 
CLASSLA: The Knowledge Centre for South Slavic Languages <https://www.clarin.si/info/k-centre/>
 
CLARIN.SI <http://clarin.si/>
Jožef Stefan Institute

Jamova cesta 39, Ljubljana
Slovenia

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.ijs.si/pipermail/classla/attachments/20211223/bb33d9ef/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: d542a766ebbbc112d5bc5d9e40be271b526a92c6.jpeg
Type: image/jpeg
Size: 22308 bytes
Desc: not available
URL: <https://mailman.ijs.si/pipermail/classla/attachments/20211223/bb33d9ef/attachment-0001.jpeg>


More information about the CLASSLA mailing list