From taja.kuzman at ijs.si Wed Dec 18 13:55:17 2024 From: taja.kuzman at ijs.si (Taja Kuzman) Date: Wed, 18 Dec 2024 13:55:17 +0100 Subject: [CLASSLA] CLASSLA Annual Recap: 2024 in Review Message-ID: <5003eb3c-be9c-451f-8f2b-40449e7244e7@ijs.si> CLASSLA Mailing List * Dear all, As the year comes to a close, we would like to share a brief summary of the main activities and progress made at the CLASSLA Knowledge Centre for South Slavic Languages during 2024. CLASSLA web corpora for South Slavic languages This year, we set up a crawling infrastructure for the (bi)annual collection of web corpora for South Slavic languages – the CLASSLA-web corpora collection . The first version of corpora, CLASSLA-web 1.0, comprising 11 billion words in 7 languages, was included to the CLARIN.SI concordancers in 2023 and released on the CLARIN.SI repository this year . The web corpora are linguistically annotated with an improved CLASSLA-Stanza tool for linguistic annotation of South Slavic languages (available as a service here ) and a multilingual genre classifier X-GENRE . Owing to their large size and recency, the CLASSLA-web corpora have already shown to be very useful for the development of large language models for South Slavic languages, and were included in the training datasets for the GaMS (Generative Model for Slovene) model and the YugoGPT model for Bosnian, Croatian, and Serbian. The next version of the CLASSLA web corpora has already been collected, and the release is planned for 2025. CLASSLA-Expressworkshop series In collaboration with Ivana Filipović Petrović, Jelena Parizoska and Petya Osenova, we organized seven CLASSLA-Express workshops in five South Slavic countries, attended by over 120 participants. The workshops focused on introducing concordancers, CLASSLA-web corpora, and CLARIN.SI services to linguists, lexicographers, language teachers, digital humanities scholars, and students. Feedback was extremely positive, and we are planning additional workshops for 2025, with sessions to be held in Bulgaria, Croatia, and Slovenia, as well as expanding beyond the South Slavic region to locations such as Austria. The workshops will also feature new topics, including the application of large language models in corpus linguistics and lexicography . Stay tuned for more details about the upcoming workshops! Benchmarking LLMs for South Slavic languages and dialects The rapid advancements in large language models have also reached South Slavic languages, and evaluation of their capabilities has become crucial to understand the strengths and limitations of these models for our languages, and to guide future development in both academic and applied settings. To this end, we benchmarked large language models for South Slavic languages and dialects, including the Torlak, the Chakavian, and the Cerkno dialect , on the task of commonsense reasoning. The results showed impressive capabilities of GPT models in handling South Slavic languages, showcasing not only their strong performance but also their ability to adapt to dialects. Remarkably, these models achieved high levels of accuracy in target dialects when provided with only a handful of examples. We are excited to continue our benchmarking activities as part of the LLM4DH and LLMs4EU projects, which will extend over the next few years. Speech technologies We continued dipping our toes into the world of speech technology. Our efforts included the development of the automatic speech recognition (ASR) system tailored to the Chakavian dialect based on the Mići Princ dataset . We also worked on the Mezzanine , ParlaSpeech and Mak na konac projects, which focus on developing spoken corpora and benchmarking speech technologies for Slovenian, Croatian and Serbian. In addition to developing various speech technologies, such as the classifier for filled pauses in speech (eem) that works splendidly for a series of South Slavic languages, we started building the CLASSLA infrastructure for speech research by publishing ParlaSpeech corpora also on concordancers . We are currently working on further enriching these corpora with disfluency information, primary stress position, and boundaries of prosodic units. Sharing knowledge on language resources for South Slavic languages As a knowledge centre, one of our core activities is sharing valuable information and supporting users in their work with language resources and technologies. Over the past year, we have responded to numerous helpdesk inquiries regarding access to resources and their use. In addition to providing direct support, we also maintain informative materials to help users navigate available resources – the CLASSLA FAQs for Slovenian , Croatian , Serbian , Bulgarian , and Macedonian . Furthermore, we released a new overview of Slovenian language technologies , summarizing the state-of-the-art language technologies for Slovenian. Monitoring the usage of language resources We also actively supported our parent organization, CLARIN.SI , by monitoring the usage of freely accessible language resources and concordancers provided by the CLARIN.SI infrastructure. This allowed us to gain valuable insights into which datasets, technologies, and corpora are used the most. We were pleased to discover significant usage from outside Slovenia, with users frequently querying corpora in over 18 different languages. We invite you to watch a brief 1-minute video presenting key statistics, including the number of visits, most popular resources, and a closer look at concordancer usage. We are also very happy with the uptake of our Hugging Face page from where our ParlaSpeech corpora have been downloaded more than 6,000 times in the last few months. Our models are also heavily used, with the recently published multilingual IPTC news topic classifier being downloaded almost 13,000 times in the past four months. We would like to take this opportunity to thank all our collaborators for another incredibly productive year and to express our gratitude to you for staying engaged with our activities. We look forward to another year of exciting developments and continued collaboration. Wishing you all a successful and fulfilling year ahead, both professionally and personally. Best wishes, Nikola Ljubešić, Taja Kuzman and other CLASSLAers**** * CLASSLA: The Knowledge Centre for South Slavic Languages CLARIN.SI Jožef Stefan Institute Jamova cesta 39, Ljubljana Slovenia -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: EgnHzq0OLDrCZSz9.png Type: image/png Size: 174960 bytes Desc: not available URL: