[CLASSLA] CLASSLA Annual Recap: 2024 in Review
Taja Kuzman
taja.kuzman at ijs.si
Wed Dec 18 13:55:17 CET 2024
CLASSLA Mailing List
*
Dear all,
As the year comes to a close, we would like to share a brief summary of
the main activities and progress made at the CLASSLA Knowledge Centre
for South Slavic Languages during 2024.
CLASSLA web corpora for South Slavic languages
This year, we set up a crawling infrastructure for the (bi)annual
collection of web corpora for South Slavic languages – the CLASSLA-web
corpora collection <https://aclanthology.org/2024.lrec-main.291/>. The
first version of corpora, CLASSLA-web 1.0, comprising 11 billion words
in 7 languages, was included to the CLARIN.SI concordancers
<https://www.clarin.si/ske/#open>in 2023 and released on the CLARIN.SI
repository this year
<https://www.clarin.si/repository/xmlui/discover?query=%22CLASSLA-web%22&submit=Search&filtertype_1=title&filter_relational_operator_1=contains&filter_1=%22CLASSLA-web%22&filtertype_2=title&filter_relational_operator_2=contains&filter_2=&query=&rpp=10&sort_by=dc.date.issued_dt&order=desc>.
The web corpora are linguistically annotated with an improved
CLASSLA-Stanza <https://zenodo.org/records/13936406>tool for linguistic
annotation of South Slavic languages (available as a service here
<https://clarin.si/oznacevalnik/eng>) and a multilingual genre
classifier X-GENRE
<https://huggingface.co/classla/xlm-roberta-base-multilingual-text-genre-classifier>.
Owing to their large size and recency, the CLASSLA-web corpora have
already shown to be very useful for the development of large language
models for South Slavic languages, and were included in the training
datasets for the GaMS <https://huggingface.co/cjvt/GaMS-1B>(Generative
Model for Slovene) model and the YugoGPT
<https://huggingface.co/gordicaleksa/YugoGPT>model for Bosnian,
Croatian, and Serbian. The next version of the CLASSLA web corpora has
already been collected, and the release is planned for 2025.
CLASSLA-Expressworkshop series
In collaboration with Ivana Filipović Petrović, Jelena Parizoska and
Petya Osenova, we organized seven CLASSLA-Express workshops
<https://www.clarin.si/info/k-centre/workshops/classla-express/>in five
South Slavic countries, attended by over 120 participants. The workshops
focused on introducing concordancers, CLASSLA-web corpora, and CLARIN.SI
services to linguists, lexicographers, language teachers, digital
humanities scholars, and students. Feedback was extremely positive, and
we are planning additional workshops for 2025, with sessions to be held
in Bulgaria, Croatia, and Slovenia, as well as expanding beyond the
South Slavic region to locations such as Austria. The workshops will
also feature new topics, including the application of large language
models in corpus linguistics and lexicography
<https://www.clarin.si/info/k-centre/workshops/#September_2024_Round_Table_on_the_Usage_of_Large_Language_Models_in_Corpus-Linguistic_Research>.
Stay tuned for more details about the upcoming workshops!
Benchmarking LLMs for South Slavic languages and dialects
The rapid advancements in large language models have also reached South
Slavic languages, and evaluation of their capabilities has become
crucial to understand the strengths and limitations of these models for
our languages, and to guide future development in both academic and
applied settings. To this end, we benchmarked large language models for
South Slavic languages and dialects, including the Torlak, the
Chakavian, and the Cerkno dialect <http://hdl.handle.net/11356/1766>, on
the task of commonsense reasoning. The results
<https://aclanthology.org/2024.vardial-1.18/>showed impressive
capabilities of GPT models in handling South Slavic languages,
showcasing not only their strong performance but also their ability to
adapt to dialects. Remarkably, these models achieved high levels of
accuracy in target dialects when provided with only a handful of
examples. We are excited to continue our benchmarking activities as part
of the LLM4DH and LLMs4EU
<https://alt-edic.eu/projects/llms4eu/>projects, which will extend over
the next few years.
Speech technologies
We continued dipping our toes into the world of speech technology. Our
efforts included the development of the automatic speech recognition
(ASR) system tailored to the Chakavian dialect
<https://huggingface.co/classla/whisper-large-v3-mici-princ>based on the
Mići Princ dataset <https://huggingface.co/datasets/classla/Mici_Princ>.
We also worked on the Mezzanine
<https://mezzanine.um.si/en/mezzanine-english/>, ParlaSpeech
<https://arxiv.org/abs/2409.15397>and Mak na konac
<https://zenodo.org/records/13936420>projects, which focus on developing
spoken corpora and benchmarking speech technologies for Slovenian,
Croatian and Serbian. In addition to developing various speech
technologies, such as the classifier for filled pauses in speech (eem)
<https://huggingface.co/classla/wav2vecbert2-filledPause>that works
splendidly for a series of South Slavic languages, we started building
the CLASSLA infrastructure for speech research by publishing ParlaSpeech
corpora also on concordancers
<https://www.clarin.si/ske/#dashboard?corpname=parlaspeech_hr>. We are
currently working on further enriching these corpora with disfluency
information, primary stress position, and boundaries of prosodic units.
Sharing knowledge on language resources for South Slavic languages
As a knowledge centre, one of our core activities is sharing valuable
information and supporting users in their work with language resources
and technologies. Over the past year, we have responded to numerous
helpdesk inquiries regarding access to resources and their use. In
addition to providing direct support, we also maintain informative
materials to help users navigate available resources – the CLASSLA FAQs
for Slovenian <https://www.clarin.si/info/k-centre/faq4slovene/>,
Croatian <https://www.clarin.si/info/k-centre/faq4croatian/>, Serbian
<https://www.clarin.si/info/k-centre/faq4serbian/>, Bulgarian
<https://www.clarin.si/info/k-centre/faq4bulgarian/>, and Macedonian
<https://www.clarin.si/info/k-centre/faq4macedonian/>. Furthermore, we
released a new overview of Slovenian language technologies
<https://github.com/clarinsi/Slovenian-Language-Technologies-Overview/tree/main>,
summarizing the state-of-the-art language technologies for Slovenian.
Monitoring the usage of language resources
We also actively supported our parent organization, CLARIN.SI
<https://www.clarin.si/info/about/>, by monitoring the usage of freely
accessible language resources and concordancers provided by the
CLARIN.SI infrastructure. This allowed us to gain valuable insights into
which datasets, technologies, and corpora are used the most. We were
pleased to discover significant usage from outside Slovenia, with users
frequently querying corpora in over 18 different languages. We invite
you to watch a brief 1-minute video
<https://www.clarin.si/info/end-of-year-review-clarin-si-in-2024/>presenting
key statistics, including the number of visits, most popular resources,
and a closer look at concordancer usage.
We are also very happy with the uptake of our Hugging Face page
<https://huggingface.co/classla>from where our ParlaSpeech corpora have
been downloaded more than 6,000 times in the last few months. Our models
are also heavily used, with the recently published multilingual IPTC
news topic classifier
<https://huggingface.co/classla/multilingual-IPTC-news-topic-classifier>being
downloaded almost 13,000 times in the past four months.
We would like to take this opportunity to thank all our collaborators
for another incredibly productive year and to express our gratitude to
you for staying engaged with our activities. We look forward to another
year of exciting developments and continued collaboration. Wishing you
all a successful and fulfilling year ahead, both professionally and
personally.
Best wishes,
Nikola Ljubešić, Taja Kuzman and other CLASSLAers****
*
CLASSLA: The Knowledge Centre for South Slavic Languages
<https://www.clarin.si/info/k-centre/>
CLARIN.SI <http://clarin.si/>
Jožef Stefan Institute
Jamova cesta 39, Ljubljana
Slovenia
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.ijs.si/pipermail/classla/attachments/20241218/1c503d20/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: EgnHzq0OLDrCZSz9.png
Type: image/png
Size: 174960 bytes
Desc: not available
URL: <https://mailman.ijs.si/pipermail/classla/attachments/20241218/1c503d20/attachment-0001.png>
More information about the CLASSLA
mailing list