From taja.kuzman at ijs.si Mon Apr 15 08:50:53 2024 From: taja.kuzman at ijs.si (Taja Kuzman) Date: Mon, 15 Apr 2024 08:50:53 +0200 Subject: [CLASSLA] =?utf-8?q?Mi=C4=87i_Princ_meets_the_Whisper_ASR_model?= Message-ID: CLASSLA Mailing List ** * Dear all, As you might have noticed, recently, we extended our efforts of providing language resources and technologies from standard South Slavic languages to South Slavic dialects as well (you might have heard about the COPA datasets in Cerkno, Torlak and Chakavian dialects which are the stars of the DIALECT-COPA unshared task at the VarDial 2024 workshop in Mexico City). Now, we are pleased to announce the first resources for speech technologies for Chakavian micro-dialects of Croatian: the Mići Princ dataset and an automatic speech recognition model for Chakavian , both openly available. The Mići Princ dataset is a "text and speech" dialectal translation of Antoine de Saint-Exupéry's "Le Petit Prince" (The Little Prince) into various Chakavian micro-dialects, released by the Udruga Calculus and the Peek&Poke museum, both in form of a printed book and an audio book . Almost every character in the book was translated and narrated into a different micro-dialect (for which we would like to thank again the large team of translators and audio book narrators behind this, especially the main translator, Tea Perinčić). Following the creation of the Mići Princ dataset, our colleagues Peter Rupnik and Nikola Ljubešić aligned the text and speech to develop the first openly-available dataset for Chakavian automatic-speech recognition (ASR). The dataset is published on the CLARIN.SI repository , as well as on Hugging Face, where you can listen to it . Moreover, we are pleased to introduce an innovative outcome derived from this dataset: Whisper-large-v3-mici-princ , an automatic speech recognition model for Chakavian. Through fine-tuning OpenAI's Whisper model on the Mići Princ dataset, we achieved a great character-error-rate reduction of 66%. You are welcome to try it out on Hugging Face ! Best regards, The CLASSLA team* CLASSLA: The Knowledge Centre for South Slavic Languages CLARIN.SI Jožef Stefan Institute Jamova cesta 39, Ljubljana Slovenia -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: EgnHzq0OLDrCZSz9.png Type: image/png Size: 174960 bytes Desc: not available URL: