[CLASSLA] CLASSLA Annual Recap: 2025 in Review

Mon Dec 29 14:43:42 CET 2025

CLASSLA Mailing List

Dear all,

As we wrap up another eventful year, we would like to share an overview 
of the key developments and activities at the CLASSLA Knowledge Centre 
for South Slavic Languages during 2025.

*CLASSLA-web corpora for South Slavic languages*

We are excited to announce that we have released the second version of 
the CLASSLA-web corpora, comprising texts that were collected from the 
web in 2024. You can now already query the new corpora on the CLARIN.SI 
concordancer (Bosnian 
<https://www.clarin.si/ske/#dashboard?corpname=classlaweb2_bs>, 
Bulgarian 
<https://www.clarin.si/ske/#dashboard?corpname=classlaweb2_bg>, Croatian 
<https://www.clarin.si/ske/#dashboard?corpname=classlaweb2_hr>, 
Macedonian 
<https://www.clarin.si/ske/#dashboard?corpname=classlaweb2_mk>, 
Montenegrin 
<https://www.clarin.si/ske/#dashboard?corpname=classlaweb2_cnr>, Serbian 
<https://www.clarin.si/ske/#dashboard?corpname=classlaweb2_sr>, and 
Slovenian <https://www.clarin.si/ske/#dashboard?corpname=classlaweb2_sl> 
corpora) or find more information about both 1.0 and 2.0 versions of 
CLASSLA-web corpora on a new website: 
https://clarinsi.github.io/classla-web/

Although collected from the same national domains as version 1.0 from 
2021 and 2022, the new release is substantially larger and contains 
mostly new material: around 50% more texts and words, totalling 38 
million texts and 17 billion words across seven South Slavic languages. 
The corpora are linguistically annotated with an improved CLASSLA-Stanza 
<https://zenodo.org/records/13936406> tool (available as a service here 
<https://clarin.si/oznacevalnik/eng>) and a multilingual genre 
classifier X-GENRE 
<https://huggingface.co/classla/xlm-roberta-base-multilingual-text-genre-classifier>. 
In addition, version 2.0 now also includes topic labels based on our 
multilingual news topic classifier 
<https://huggingface.co/classla/multilingual-IPTC-news-topic-classifier>. Soon, 
the corpora will also be available on the CLARIN.SI repository in JSONL 
and linguistically-annotated VERT formats.

*CLASSLA-Express workshop series*

Our CLASSLA-Express workshop 
<https://www.clarin.si/info/k-centre/workshops/classla-express/> 
programme expanded both in content and geography. This year, seven 
workshops were held across four countries – Austria, Bulgaria, Croatia, 
and Slovenia – led primarily by Ivana Filipović Petrović and Jelena 
Parizoska, with contributions from Petya Osenova and local organizers. 
In addition to demonstrating the use of CLARIN.SI concordancers and the 
CLASSLA-web corpora, the workshops introduced new topics with a strong 
focus on applying modern AI methods in linguistic research. We are 
delighted by the continued interest and encourage you to explore the 
_detailed workshop reports 
<https://www.clarin.si/info/k-centre/workshops/>_ available on our 
website. You are warmly invited to stay tuned: CLASSLA-Express 3.0, with 
a new focus on spoken corpora, is already on the horizon.

*Benchmarking large language models for South Slavic languages and dialects*

Evaluation of large language models (LLMs) continued to be one of our 
key activities. This year, we participated in development of multiple 
South Slavic benchmarks for LLM evaluation, including the Global-PIQA 
<https://arxiv.org/abs/2510.24081> test set, a multilingual commonsense 
reasoning benchmark developed by 335 co-authors and covering 116 
languages and dialects, including standard South Slavic languages, as 
well as Torlak, Chakavian, and the Slovenian Cerkno dialects.

In parallel, we launched an interactive platform presenting evaluation 
results for South Slavic languages and dialects across six tasks 
<https://llm-benchmarks-classla.streamlit.app/>: two commonsense 
reasoning benchmark families (COPA and PIQA), sentiment classification, 
news topic classification, and automatic genre identification. The 
platform enables researchers and developers to compare large language 
model performance, identify strengths and weaknesses, and follow 
developments over time. To support further experimentation and 
application, we provide an _accompanying paper with an overview of 
current model performance <https://arxiv.org/abs/2511.07989>_ as well as 
open-source code 
<https://github.com/TajaKuzman/Benchmarking-Text-Classification-on-South-Slavic> 
for running evaluations and adapting LLMs to new tasks. We are excited 
to continue our benchmarking activities as part of the LLM4DH 
<https://www.cjvt.si/llm4dh/en/> and LLMs4EU 
<https://alt-edic.eu/projects/llms4eu/> projects, which will extend over 
the next few years.

*Speech corpora and technologies*

Our efforts in speech resources advanced significantly this year, with a 
major focus on expanding and enriching parliamentary speech corpora. A 
key achievement was the release of ParlaSpeech 3.0 
<https://clarinsi.github.io/parlaspeech/>, a multilingual collection 
covering Croatian, Serbian, Czech, and Polish parliamentary proceedings. 
In the new release, ParlaSpeech has been extended with five annotation 
layers: linguistic annotation, sentiment labels, filled-pause detection, 
precise word-level alignments, and primary stress information. These 
enrichment layers have been added automatically with cutting-edge models 
for processing speech and text, most of which can be found on the 
CLASSLA Hugging Face page <https://huggingface.co/classla>. The 
enrichments enable advanced studies of prosody, disfluency patterns, and 
multimodal aspects of parliamentary speech. In addition to the CLARIN.SI 
repository <http://hdl.handle.net/11356/1833>, the corpora are now 
accessible through the CLARIN.SI concordancers (Croatian 
<https://www.clarin.si/ske/#concordance?corpname=parlaspeech3_hr>, 
Serbian 
<https://www.clarin.si/ske/#concordance?corpname=parlaspeech3_rs>, Czech 
<https://www.clarin.si/ske/#concordance?corpname=parlaspeech3_cz> and 
Polish 
<https://www.clarin.si/ske/#concordance?corpname=parlaspeech3_pl>), 
accompanied by a tutorial on how to query them 
<https://clarinsi.github.io/parlaspeech/concordancer/concordancer-guide.html>.

*Supporting SSH researchers in working with large language models*

As part of the newly established LLMs4SSH 
<https://llms4ssh.clarin-pl.eu/> CLARIN Knowledge Centre, we contributed 
expertise to help researchers in the social sciences and humanities 
navigate the rapidly evolving landscape of large language models. Our 
contributions included an overview of Slovenian activities, 
technologies, and datasets related to LLM development 
<https://www.clarin.si/info/k-centres/llms4ssh-clarin-k-centre-for-large-language-models-in-ssh/>; 
a proposal for a new taxonomy for LLM evaluation datasets 
<https://arxiv.org/abs/2510.24450>; and a concept for a European 
database offering a clear map of available resources by language and 
evaluation task.

*Models and datasets on Hugging Face*

This year, numerous new models and datasets were released to the CLASSLA 
Hugging Face page <https://huggingface.co/classla>, including the first 
openly-available multilingual IPTC news topic classifier 
<https://huggingface.co/classla/multilingual-IPTC-news-topic-classifier>, 
which has already surpassed 600,000 downloads. We are thrilled to see 
such strong uptake and will continue expanding the collection of openly 
accessible tools and corpora for South Slavic languages and beyond.

*Looking ahead*

As we reflect on this year’s achievements, we extend our sincere thanks 
to all team members and collaborators who have contributed to our 
activities, and to the users who uptake on our resources. Your 
engagement and feedback drive our continued commitment to supporting 
linguistic research and technology development for South Slavic languages.

We look forward to another productive year filled with exciting advances 
and new collaborations. Wishing you a successful and inspiring year ahead!

Best wishes,

Nikola, Taja, and many other CLASSLAers

CLASSLA: The Knowledge Centre for South Slavic Languages 
<https://www.clarin.si/info/k-centre/>

CLARIN.SI <http://clarin.si/>

Jožef Stefan Institute

Jamova cesta 39, Ljubljana
Slovenia

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.ijs.si/pipermail/classla/attachments/20251229/365ae95c/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: EgnHzq0OLDrCZSz9.png
Type: image/png
Size: 174960 bytes
Desc: not available
URL: <https://mailman.ijs.si/pipermail/classla/attachments/20251229/365ae95c/attachment-0001.png>