[CLASSLA] Mići Princ meets the Whisper ASR model
Taja Kuzman
taja.kuzman at ijs.si
Mon Apr 15 08:50:53 CEST 2024
CLASSLA Mailing List
** *
Dear all,
As you might have noticed, recently, we extended our efforts of
providing language resources and technologies from standard South Slavic
languages to South Slavic dialects as well (you might have heard about
the COPA datasets in Cerkno, Torlak and Chakavian dialects which are the
stars of the DIALECT-COPA unshared task at the VarDial 2024 workshop
<https://sites.google.com/view/vardial-2024/shared-tasks/dialect-copa>in
Mexico City). Now, we are pleased to announce the first resources for
speech technologies for Chakavian micro-dialects of Croatian: the Mići
Princ dataset <http://hdl.handle.net/11356/1765>and an automatic speech
recognition model for Chakavian
<https://huggingface.co/classla/whisper-large-v3-mici-princ>, both
openly available.
The Mići Princ dataset <http://hdl.handle.net/11356/1765>is a "text and
speech" dialectal translation of Antoine de Saint-Exupéry's "Le Petit
Prince" (The Little Prince) into various Chakavian micro-dialects,
released by the Udruga Calculus and the Peek&Poke museum, both in form
of a printed book
<https://www.peekpoke.hr/mici-princ-an-edition-of-the-little-prince-in-the-chakavian-dialect-book-presentation/>and
an audio book
<https://www.peekpoke.hr/mici-princ-the-little-prince-in-the-chakavian-dialect-audio-book-presentation-and-exhibition/>.
Almost every character in the book was translated and narrated into a
different micro-dialect (for which we would like to thank again the
large team of translators and audio book narrators behind this,
especially the main translator, Tea Perinčić).
Following the creation of the Mići Princ dataset, our colleagues Peter
Rupnik and Nikola Ljubešić aligned the text and speech to develop the
first openly-available dataset for Chakavian automatic-speech
recognition (ASR). The dataset is published on the CLARIN.SI repository
<http://hdl.handle.net/11356/1765>, as well as on Hugging Face, where
you can listen to it <https://huggingface.co/datasets/classla/Mici_Princ>.
Moreover, we are pleased to introduce an innovative outcome derived from
this dataset: Whisper-large-v3-mici-princ
<https://huggingface.co/classla/whisper-large-v3-mici-princ>, an
automatic speech recognition model for Chakavian. Through fine-tuning
OpenAI's Whisper model on the Mići Princ dataset, we achieved a great
character-error-rate reduction of 66%. You are welcome to try it out on
Hugging Face
<https://huggingface.co/classla/whisper-large-v3-mici-princ>! Best
regards, The CLASSLA team*
CLASSLA: The Knowledge Centre for South Slavic Languages
<https://www.clarin.si/info/k-centre/>
CLARIN.SI <http://clarin.si/>
Jožef Stefan Institute
Jamova cesta 39, Ljubljana
Slovenia
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.ijs.si/pipermail/classla/attachments/20240415/2eb4e6ba/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: EgnHzq0OLDrCZSz9.png
Type: image/png
Size: 174960 bytes
Desc: not available
URL: <https://mailman.ijs.si/pipermail/classla/attachments/20240415/2eb4e6ba/attachment-0001.png>
More information about the CLASSLA
mailing list