[CLASSLA] CLASSLA Annual Recap: 2025 in Review
Taja Kuzman Pungeršek
taja.kuzman at ijs.si
Mon Dec 29 14:43:42 CET 2025
CLASSLA Mailing List
Dear all,
As we wrap up another eventful year, we would like to share an overview
of the key developments and activities at the CLASSLA Knowledge Centre
for South Slavic Languages during 2025.
*CLASSLA-web corpora for South Slavic languages*
We are excited to announce that we have released the second version of
the CLASSLA-web corpora, comprising texts that were collected from the
web in 2024. You can now already query the new corpora on the CLARIN.SI
concordancer (Bosnian
<https://www.clarin.si/ske/#dashboard?corpname=classlaweb2_bs>,
Bulgarian
<https://www.clarin.si/ske/#dashboard?corpname=classlaweb2_bg>, Croatian
<https://www.clarin.si/ske/#dashboard?corpname=classlaweb2_hr>,
Macedonian
<https://www.clarin.si/ske/#dashboard?corpname=classlaweb2_mk>,
Montenegrin
<https://www.clarin.si/ske/#dashboard?corpname=classlaweb2_cnr>, Serbian
<https://www.clarin.si/ske/#dashboard?corpname=classlaweb2_sr>, and
Slovenian <https://www.clarin.si/ske/#dashboard?corpname=classlaweb2_sl>
corpora) or find more information about both 1.0 and 2.0 versions of
CLASSLA-web corpora on a new website:
https://clarinsi.github.io/classla-web/
Although collected from the same national domains as version 1.0 from
2021 and 2022, the new release is substantially larger and contains
mostly new material: around 50% more texts and words, totalling 38
million texts and 17 billion words across seven South Slavic languages.
The corpora are linguistically annotated with an improved CLASSLA-Stanza
<https://zenodo.org/records/13936406> tool (available as a service here
<https://clarin.si/oznacevalnik/eng>) and a multilingual genre
classifier X-GENRE
<https://huggingface.co/classla/xlm-roberta-base-multilingual-text-genre-classifier>.
In addition, version 2.0 now also includes topic labels based on our
multilingual news topic classifier
<https://huggingface.co/classla/multilingual-IPTC-news-topic-classifier>. Soon,
the corpora will also be available on the CLARIN.SI repository in JSONL
and linguistically-annotated VERT formats.
*CLASSLA-Express workshop series*
Our CLASSLA-Express workshop
<https://www.clarin.si/info/k-centre/workshops/classla-express/>
programme expanded both in content and geography. This year, seven
workshops were held across four countries – Austria, Bulgaria, Croatia,
and Slovenia – led primarily by Ivana Filipović Petrović and Jelena
Parizoska, with contributions from Petya Osenova and local organizers.
In addition to demonstrating the use of CLARIN.SI concordancers and the
CLASSLA-web corpora, the workshops introduced new topics with a strong
focus on applying modern AI methods in linguistic research. We are
delighted by the continued interest and encourage you to explore the
_detailed workshop reports
<https://www.clarin.si/info/k-centre/workshops/>_ available on our
website. You are warmly invited to stay tuned: CLASSLA-Express 3.0, with
a new focus on spoken corpora, is already on the horizon.
*Benchmarking large language models for South Slavic languages and dialects*
Evaluation of large language models (LLMs) continued to be one of our
key activities. This year, we participated in development of multiple
South Slavic benchmarks for LLM evaluation, including the Global-PIQA
<https://arxiv.org/abs/2510.24081> test set, a multilingual commonsense
reasoning benchmark developed by 335 co-authors and covering 116
languages and dialects, including standard South Slavic languages, as
well as Torlak, Chakavian, and the Slovenian Cerkno dialects.
In parallel, we launched an interactive platform presenting evaluation
results for South Slavic languages and dialects across six tasks
<https://llm-benchmarks-classla.streamlit.app/>: two commonsense
reasoning benchmark families (COPA and PIQA), sentiment classification,
news topic classification, and automatic genre identification. The
platform enables researchers and developers to compare large language
model performance, identify strengths and weaknesses, and follow
developments over time. To support further experimentation and
application, we provide an _accompanying paper with an overview of
current model performance <https://arxiv.org/abs/2511.07989>_ as well as
open-source code
<https://github.com/TajaKuzman/Benchmarking-Text-Classification-on-South-Slavic>
for running evaluations and adapting LLMs to new tasks. We are excited
to continue our benchmarking activities as part of the LLM4DH
<https://www.cjvt.si/llm4dh/en/> and LLMs4EU
<https://alt-edic.eu/projects/llms4eu/> projects, which will extend over
the next few years.
*Speech corpora and technologies*
Our efforts in speech resources advanced significantly this year, with a
major focus on expanding and enriching parliamentary speech corpora. A
key achievement was the release of ParlaSpeech 3.0
<https://clarinsi.github.io/parlaspeech/>, a multilingual collection
covering Croatian, Serbian, Czech, and Polish parliamentary proceedings.
In the new release, ParlaSpeech has been extended with five annotation
layers: linguistic annotation, sentiment labels, filled-pause detection,
precise word-level alignments, and primary stress information. These
enrichment layers have been added automatically with cutting-edge models
for processing speech and text, most of which can be found on the
CLASSLA Hugging Face page <https://huggingface.co/classla>. The
enrichments enable advanced studies of prosody, disfluency patterns, and
multimodal aspects of parliamentary speech. In addition to the CLARIN.SI
repository <http://hdl.handle.net/11356/1833>, the corpora are now
accessible through the CLARIN.SI concordancers (Croatian
<https://www.clarin.si/ske/#concordance?corpname=parlaspeech3_hr>,
Serbian
<https://www.clarin.si/ske/#concordance?corpname=parlaspeech3_rs>, Czech
<https://www.clarin.si/ske/#concordance?corpname=parlaspeech3_cz> and
Polish
<https://www.clarin.si/ske/#concordance?corpname=parlaspeech3_pl>),
accompanied by a tutorial on how to query them
<https://clarinsi.github.io/parlaspeech/concordancer/concordancer-guide.html>.
*Supporting SSH researchers in working with large language models*
As part of the newly established LLMs4SSH
<https://llms4ssh.clarin-pl.eu/> CLARIN Knowledge Centre, we contributed
expertise to help researchers in the social sciences and humanities
navigate the rapidly evolving landscape of large language models. Our
contributions included an overview of Slovenian activities,
technologies, and datasets related to LLM development
<https://www.clarin.si/info/k-centres/llms4ssh-clarin-k-centre-for-large-language-models-in-ssh/>;
a proposal for a new taxonomy for LLM evaluation datasets
<https://arxiv.org/abs/2510.24450>; and a concept for a European
database offering a clear map of available resources by language and
evaluation task.
*Models and datasets on Hugging Face*
This year, numerous new models and datasets were released to the CLASSLA
Hugging Face page <https://huggingface.co/classla>, including the first
openly-available multilingual IPTC news topic classifier
<https://huggingface.co/classla/multilingual-IPTC-news-topic-classifier>,
which has already surpassed 600,000 downloads. We are thrilled to see
such strong uptake and will continue expanding the collection of openly
accessible tools and corpora for South Slavic languages and beyond.
*Looking ahead*
As we reflect on this year’s achievements, we extend our sincere thanks
to all team members and collaborators who have contributed to our
activities, and to the users who uptake on our resources. Your
engagement and feedback drive our continued commitment to supporting
linguistic research and technology development for South Slavic languages.
We look forward to another productive year filled with exciting advances
and new collaborations. Wishing you a successful and inspiring year ahead!
Best wishes,
Nikola, Taja, and many other CLASSLAers
CLASSLA: The Knowledge Centre for South Slavic Languages
<https://www.clarin.si/info/k-centre/>
CLARIN.SI <http://clarin.si/>
Jožef Stefan Institute
Jamova cesta 39, Ljubljana
Slovenia
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.ijs.si/pipermail/classla/attachments/20251229/365ae95c/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: EgnHzq0OLDrCZSz9.png
Type: image/png
Size: 174960 bytes
Desc: not available
URL: <https://mailman.ijs.si/pipermail/classla/attachments/20251229/365ae95c/attachment-0001.png>
More information about the CLASSLA
mailing list