[CLASSLA] Mići Princ meets the Whisper ASR model

Mon Apr 15 08:50:53 CEST 2024

CLASSLA Mailing List

** *

Dear all,

As you might have noticed, recently, we extended our efforts of 
providing language resources and technologies from standard South Slavic 
languages to South Slavic dialects as well (you might have heard about 
the COPA datasets in Cerkno, Torlak and Chakavian dialects which are the 
stars of the DIALECT-COPA unshared task at the VarDial 2024 workshop 
<https://sites.google.com/view/vardial-2024/shared-tasks/dialect-copa>in 
Mexico City). Now, we are pleased to announce the first resources for 
speech technologies for Chakavian micro-dialects of Croatian: the Mići 
Princ dataset <http://hdl.handle.net/11356/1765>and an automatic speech 
recognition model for Chakavian 
<https://huggingface.co/classla/whisper-large-v3-mici-princ>, both 
openly available.

The Mići Princ dataset <http://hdl.handle.net/11356/1765>is a "text and 
speech" dialectal translation of Antoine de Saint-Exupéry's "Le Petit 
Prince" (The Little Prince) into various Chakavian micro-dialects, 
released by the Udruga Calculus and the Peek&Poke museum, both in form 
of a printed book 
<https://www.peekpoke.hr/mici-princ-an-edition-of-the-little-prince-in-the-chakavian-dialect-book-presentation/>and 
an audio book 
<https://www.peekpoke.hr/mici-princ-the-little-prince-in-the-chakavian-dialect-audio-book-presentation-and-exhibition/>. 
Almost every character in the book was translated and narrated into a 
different micro-dialect (for which we would like to thank again the 
large team of translators and audio book narrators behind this, 
especially the main translator, Tea Perinčić).

Following the creation of the Mići Princ dataset, our colleagues Peter 
Rupnik and Nikola Ljubešić aligned the text and speech to develop the 
first openly-available dataset for Chakavian automatic-speech 
recognition (ASR). The dataset is published on the CLARIN.SI repository 
<http://hdl.handle.net/11356/1765>, as well as on Hugging Face, where 
you can listen to it <https://huggingface.co/datasets/classla/Mici_Princ>.

Moreover, we are pleased to introduce an innovative outcome derived from 
this dataset: Whisper-large-v3-mici-princ 
<https://huggingface.co/classla/whisper-large-v3-mici-princ>, an 
automatic speech recognition model for Chakavian. Through fine-tuning 
OpenAI's Whisper model on the Mići Princ dataset, we achieved a great 
character-error-rate reduction of 66%. You are welcome to try it out on 
Hugging Face 
<https://huggingface.co/classla/whisper-large-v3-mici-princ>! Best 
regards, The CLASSLA team*

CLASSLA: The Knowledge Centre for South Slavic Languages 
<https://www.clarin.si/info/k-centre/>

CLARIN.SI <http://clarin.si/>

Jožef Stefan Institute

Jamova cesta 39, Ljubljana
Slovenia

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.ijs.si/pipermail/classla/attachments/20240415/2eb4e6ba/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: EgnHzq0OLDrCZSz9.png
Type: image/png
Size: 174960 bytes
Desc: not available
URL: <https://mailman.ijs.si/pipermail/classla/attachments/20240415/2eb4e6ba/attachment-0001.png>