[CLASSLA] Our recent activities on speech

Wed Oct 19 16:20:30 CEST 2022

 		CLASSLA Mailing List

Dear all, 

We wanted to share with you our recent results on speech processing,
something we mentioned will be one of our foci in 2022. 

We released two speech datasets. One is in Croatian, the ParlaSpeech-HR
dataset [1], 1816 hours of recordings in size, with accompanying
transcriptions and speaker metadata. The dataset is based on the
ParlaMint corpus [2] of Croatian parliamentary proceedings. The other
dataset is in Serbian, the JuzneVesti-SR dataset [3], "only" 50 hours in
size. It consists of audio recordings and transcripts from the Južne
Vesti website and its host show called 15 minuta [4], with speaker
metadata available as well. With each of the datasets, we released also
automatic speech recognition (ASR) models on HuggingFace, four Croatian
ASR models [5] for the ParlaSpeech-HR dataset, with excellent (but
in-domain) word error rate of only 4%, and for now one Serbian ASR model
[6] for the JuzneVesti-SR dataset. You are more than welcome to take any
of the models or data (all are available under CC-BY-SA). Interestingly,
our speech-related efforts were very quickly picked up by the industry
as well, featuring our speech and text technologies in a recent blog
[7]. 

We also published two papers, one on the overall approach to building
the ParlaSpeech-HR dataset [8], another on performing benchmarking for
user profiling over the ParlaSpeech-HR dataset [9]. 

Given the recent successes in acquiring funding for performing more
research on spoken data, in the following years we will be researching
many super-interesting speech-related tasks, including: 

 	* word-level clustering of types of pronunciation and extraction of
prototypical pronunciations
 	* linguistic processing of transcripts of spoken data, potentially
informed by the speech signal itself
 	* disfluency identification and classification
 	* dialogue act classification
 	* identifying ways to build large and cheap spoken corpora of South
Slavic languages

Please do get in touch if you are interested, or already working on
speech. Also, we invite similar e-mails - drafting future activities -
from other sides as well! We need coordination between different
efforts, something we discussed to great length in our recently
published book chapter [10]. 

Best regards, 

Nikola and Taja 

 		CLASSLA: The Knowledge Centre for South Slavic Languages [11]

CLARIN.SI [12] 

Jožef Stefan Institute 

Jamova cesta 39, Ljubljana
Slovenia 

Links:
------
[1] http://hdl.handle.net/11356/1494
[2] http://hdl.handle.net/11356/1432
[3] http://hdl.handle.net/11356/1679
[4] https://www.juznevesti.com/Tagovi/Intervju-15-minuta.sr.html
[5] https://huggingface.co/models?search=parlaspeech
[6] https://huggingface.co/classla/wav2vec2-xls-r-juznevesti-sr
[7]
https://www.neos.hr/neos-blog-can-ai-understand-croatian-parliment-asr-model/
[8]
http://www.lrec-conf.org/proceedings/lrec2022/workshops/ParlaCLARINIII/pdf/2022.parlaclariniii-1.16.pdf
[9]
https://nl.ijs.si/jtdh22/pdf/JTDH2022_Ljubesic-et-al_The-ParlaSpeech-HR-benchmark-for-speaker-profiling-in-Croatian.pdf
[10]
https://www.degruyter.com/document/doi/10.1515/9783110767377-017/html
[11] https://www.clarin.si/info/k-centre/
[12] http://clarin.si/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.ijs.si/pipermail/classla/attachments/20221019/81e8ba67/attachment.htm>