[CLASSLA] CLASSLA Annual Recap: 2024 in Review

Wed Dec 18 13:55:17 CET 2024

CLASSLA Mailing List

*

Dear all,

As the year comes to a close, we would like to share a brief summary of 
the main activities and progress made at the CLASSLA Knowledge Centre 
for South Slavic Languages during 2024.

CLASSLA web corpora for South Slavic languages

This year, we set up a crawling infrastructure for the (bi)annual 
collection of web corpora for South Slavic languages – the CLASSLA-web 
corpora collection <https://aclanthology.org/2024.lrec-main.291/>. The 
first version of corpora, CLASSLA-web 1.0, comprising 11 billion words 
in 7 languages, was included to the CLARIN.SI concordancers 
<https://www.clarin.si/ske/#open>in 2023 and released on the CLARIN.SI 
repository this year 
<https://www.clarin.si/repository/xmlui/discover?query=%22CLASSLA-web%22&submit=Search&filtertype_1=title&filter_relational_operator_1=contains&filter_1=%22CLASSLA-web%22&filtertype_2=title&filter_relational_operator_2=contains&filter_2=&query=&rpp=10&sort_by=dc.date.issued_dt&order=desc>. 
The web corpora are linguistically annotated with an improved 
CLASSLA-Stanza <https://zenodo.org/records/13936406>tool for linguistic 
annotation of South Slavic languages (available as a service here 
<https://clarin.si/oznacevalnik/eng>) and a multilingual genre 
classifier X-GENRE 
<https://huggingface.co/classla/xlm-roberta-base-multilingual-text-genre-classifier>. 
Owing to their large size and recency, the CLASSLA-web corpora have 
already shown to be very useful for the development of large language 
models for South Slavic languages, and were included in the training 
datasets for the GaMS <https://huggingface.co/cjvt/GaMS-1B>(Generative 
Model for Slovene) model and the YugoGPT 
<https://huggingface.co/gordicaleksa/YugoGPT>model for Bosnian, 
Croatian, and Serbian. The next version of the CLASSLA web corpora has 
already been collected, and the release is planned for 2025.

CLASSLA-Expressworkshop series

In collaboration with Ivana Filipović Petrović, Jelena Parizoska and 
Petya Osenova, we organized seven CLASSLA-Express workshops 
<https://www.clarin.si/info/k-centre/workshops/classla-express/>in five 
South Slavic countries, attended by over 120 participants. The workshops 
focused on introducing concordancers, CLASSLA-web corpora, and CLARIN.SI 
services to linguists, lexicographers, language teachers, digital 
humanities scholars, and students. Feedback was extremely positive, and 
we are planning additional workshops for 2025, with sessions to be held 
in Bulgaria, Croatia, and Slovenia, as well as expanding beyond the 
South Slavic region to locations such as Austria. The workshops will 
also feature new topics, including the application of large language 
models in corpus linguistics and lexicography 
<https://www.clarin.si/info/k-centre/workshops/#September_2024_Round_Table_on_the_Usage_of_Large_Language_Models_in_Corpus-Linguistic_Research>. 
Stay tuned for more details about the upcoming workshops!

Benchmarking LLMs for South Slavic languages and dialects

The rapid advancements in large language models have also reached South 
Slavic languages, and evaluation of their capabilities has become 
crucial to understand the strengths and limitations of these models for 
our languages, and to guide future development in both academic and 
applied settings. To this end, we benchmarked large language models for 
South Slavic languages and dialects, including the Torlak, the 
Chakavian, and the Cerkno dialect <http://hdl.handle.net/11356/1766>, on 
the task of commonsense reasoning. The results 
<https://aclanthology.org/2024.vardial-1.18/>showed impressive 
capabilities of GPT models in handling South Slavic languages, 
showcasing not only their strong performance but also their ability to 
adapt to dialects. Remarkably, these models achieved high levels of 
accuracy in target dialects when provided with only a handful of 
examples. We are excited to continue our benchmarking activities as part 
of the LLM4DH and LLMs4EU 
<https://alt-edic.eu/projects/llms4eu/>projects, which will extend over 
the next few years.

Speech technologies

We continued dipping our toes into the world of speech technology. Our 
efforts included the development of the automatic speech recognition 
(ASR) system tailored to the Chakavian dialect 
<https://huggingface.co/classla/whisper-large-v3-mici-princ>based on the 
Mići Princ dataset <https://huggingface.co/datasets/classla/Mici_Princ>. 
We also worked on the Mezzanine 
<https://mezzanine.um.si/en/mezzanine-english/>, ParlaSpeech 
<https://arxiv.org/abs/2409.15397>and Mak na konac 
<https://zenodo.org/records/13936420>projects, which focus on developing 
spoken corpora and benchmarking speech technologies for Slovenian, 
Croatian and Serbian. In addition to developing various speech 
technologies, such as the classifier for filled pauses in speech (eem) 
<https://huggingface.co/classla/wav2vecbert2-filledPause>that works 
splendidly for a series of South Slavic languages, we started building 
the CLASSLA infrastructure for speech research by publishing ParlaSpeech 
corpora also on concordancers 
<https://www.clarin.si/ske/#dashboard?corpname=parlaspeech_hr>. We are 
currently working on further enriching these corpora with disfluency 
information, primary stress position, and boundaries of prosodic units.

Sharing knowledge on language resources for South Slavic languages

As a knowledge centre, one of our core activities is sharing valuable 
information and supporting users in their work with language resources 
and technologies. Over the past year, we have responded to numerous 
helpdesk inquiries regarding access to resources and their use. In 
addition to providing direct support, we also maintain informative 
materials to help users navigate available resources – the CLASSLA FAQs 
for Slovenian <https://www.clarin.si/info/k-centre/faq4slovene/>, 
Croatian <https://www.clarin.si/info/k-centre/faq4croatian/>, Serbian 
<https://www.clarin.si/info/k-centre/faq4serbian/>, Bulgarian 
<https://www.clarin.si/info/k-centre/faq4bulgarian/>, and Macedonian 
<https://www.clarin.si/info/k-centre/faq4macedonian/>. Furthermore, we 
released a new overview of Slovenian language technologies 
<https://github.com/clarinsi/Slovenian-Language-Technologies-Overview/tree/main>, 
summarizing the state-of-the-art language technologies for Slovenian.

Monitoring the usage of language resources

We also actively supported our parent organization, CLARIN.SI 
<https://www.clarin.si/info/about/>, by monitoring the usage of freely 
accessible language resources and concordancers provided by the 
CLARIN.SI infrastructure. This allowed us to gain valuable insights into 
which datasets, technologies, and corpora are used the most. We were 
pleased to discover significant usage from outside Slovenia, with users 
frequently querying corpora in over 18 different languages. We invite 
you to watch a brief 1-minute video 
<https://www.clarin.si/info/end-of-year-review-clarin-si-in-2024/>presenting 
key statistics, including the number of visits, most popular resources, 
and a closer look at concordancer usage.

We are also very happy with the uptake of our Hugging Face page 
<https://huggingface.co/classla>from where our ParlaSpeech corpora have 
been downloaded more than 6,000 times in the last few months. Our models 
are also heavily used, with the recently published multilingual IPTC 
news topic classifier 
<https://huggingface.co/classla/multilingual-IPTC-news-topic-classifier>being 
downloaded almost 13,000 times in the past four months.

We would like to take this opportunity to thank all our collaborators 
for another incredibly productive year and to express our gratitude to 
you for staying engaged with our activities. We look forward to another 
year of exciting developments and continued collaboration. Wishing you 
all a successful and fulfilling year ahead, both professionally and 
personally.

Best wishes,

Nikola Ljubešić, Taja Kuzman and other CLASSLAers****

*

CLASSLA: The Knowledge Centre for South Slavic Languages 
<https://www.clarin.si/info/k-centre/>

CLARIN.SI <http://clarin.si/>

Jožef Stefan Institute

Jamova cesta 39, Ljubljana
Slovenia

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.ijs.si/pipermail/classla/attachments/20241218/1c503d20/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: EgnHzq0OLDrCZSz9.png
Type: image/png
Size: 174960 bytes
Desc: not available
URL: <https://mailman.ijs.si/pipermail/classla/attachments/20241218/1c503d20/attachment-0001.png>