From taja.kuzman at ijs.si Mon Dec 29 14:43:42 2025 From: taja.kuzman at ijs.si (=?UTF-8?Q?Taja_Kuzman_Punger=C5=A1ek?=) Date: Mon, 29 Dec 2025 14:43:42 +0100 Subject: [CLASSLA] CLASSLA Annual Recap: 2025 in Review Message-ID: CLASSLA Mailing List Dear all, As we wrap up another eventful year, we would like to share an overview of the key developments and activities at the CLASSLA Knowledge Centre for South Slavic Languages during 2025. *CLASSLA-web corpora for South Slavic languages* We are excited to announce that we have released the second version of the CLASSLA-web corpora, comprising texts that were collected from the web in 2024. You can now already query the new corpora on the CLARIN.SI concordancer (Bosnian , Bulgarian , Croatian , Macedonian , Montenegrin , Serbian , and Slovenian corpora) or find more information about both 1.0 and 2.0 versions of CLASSLA-web corpora on a new website: https://clarinsi.github.io/classla-web/ Although collected from the same national domains as version 1.0 from 2021 and 2022, the new release is substantially larger and contains mostly new material: around 50% more texts and words, totalling 38 million texts and 17 billion words across seven South Slavic languages. The corpora are linguistically annotated with an improved CLASSLA-Stanza tool (available as a service here ) and a multilingual genre classifier X-GENRE . In addition, version 2.0 now also includes topic labels based on our multilingual news topic classifier . Soon, the corpora will also be available on the CLARIN.SI repository in JSONL and linguistically-annotated VERT formats. *CLASSLA-Express workshop series* Our CLASSLA-Express workshop programme expanded both in content and geography. This year, seven workshops were held across four countries – Austria, Bulgaria, Croatia, and Slovenia – led primarily by Ivana Filipović Petrović and Jelena Parizoska, with contributions from Petya Osenova and local organizers. In addition to demonstrating the use of CLARIN.SI concordancers and the CLASSLA-web corpora, the workshops introduced new topics with a strong focus on applying modern AI methods in linguistic research. We are delighted by the continued interest and encourage you to explore the _detailed workshop reports _ available on our website. You are warmly invited to stay tuned: CLASSLA-Express 3.0, with a new focus on spoken corpora, is already on the horizon. *Benchmarking large language models for South Slavic languages and dialects* Evaluation of large language models (LLMs) continued to be one of our key activities. This year, we participated in development of multiple South Slavic benchmarks for LLM evaluation, including the Global-PIQA test set, a multilingual commonsense reasoning benchmark developed by 335 co-authors and covering 116 languages and dialects, including standard South Slavic languages, as well as Torlak, Chakavian, and the Slovenian Cerkno dialects. In parallel, we launched an interactive platform presenting evaluation results for South Slavic languages and dialects across six tasks : two commonsense reasoning benchmark families (COPA and PIQA), sentiment classification, news topic classification, and automatic genre identification. The platform enables researchers and developers to compare large language model performance, identify strengths and weaknesses, and follow developments over time. To support further experimentation and application, we provide an _accompanying paper with an overview of current model performance _ as well as open-source code for running evaluations and adapting LLMs to new tasks. We are excited to continue our benchmarking activities as part of the LLM4DH and LLMs4EU projects, which will extend over the next few years. *Speech corpora and technologies* Our efforts in speech resources advanced significantly this year, with a major focus on expanding and enriching parliamentary speech corpora. A key achievement was the release of ParlaSpeech 3.0 , a multilingual collection covering Croatian, Serbian, Czech, and Polish parliamentary proceedings. In the new release, ParlaSpeech has been extended with five annotation layers: linguistic annotation, sentiment labels, filled-pause detection, precise word-level alignments, and primary stress information. These enrichment layers have been added automatically with cutting-edge models for processing speech and text, most of which can be found on the CLASSLA Hugging Face page . The enrichments enable advanced studies of prosody, disfluency patterns, and multimodal aspects of parliamentary speech. In addition to the CLARIN.SI repository , the corpora are now accessible through the CLARIN.SI concordancers (Croatian , Serbian , Czech and Polish ), accompanied by a tutorial on how to query them . *Supporting SSH researchers in working with large language models* As part of the newly established LLMs4SSH CLARIN Knowledge Centre, we contributed expertise to help researchers in the social sciences and humanities navigate the rapidly evolving landscape of large language models. Our contributions included an overview of Slovenian activities, technologies, and datasets related to LLM development ; a proposal for a new taxonomy for LLM evaluation datasets ; and a concept for a European database offering a clear map of available resources by language and evaluation task. *Models and datasets on Hugging Face* This year, numerous new models and datasets were released to the CLASSLA Hugging Face page , including the first openly-available multilingual IPTC news topic classifier , which has already surpassed 600,000 downloads. We are thrilled to see such strong uptake and will continue expanding the collection of openly accessible tools and corpora for South Slavic languages and beyond. *Looking ahead* As we reflect on this year’s achievements, we extend our sincere thanks to all team members and collaborators who have contributed to our activities, and to the users who uptake on our resources. Your engagement and feedback drive our continued commitment to supporting linguistic research and technology development for South Slavic languages. We look forward to another productive year filled with exciting advances and new collaborations. Wishing you a successful and inspiring year ahead! Best wishes, Nikola, Taja, and many other CLASSLAers CLASSLA: The Knowledge Centre for South Slavic Languages CLARIN.SI Jožef Stefan Institute Jamova cesta 39, Ljubljana Slovenia -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: EgnHzq0OLDrCZSz9.png Type: image/png Size: 174960 bytes Desc: not available URL: