From taja.kuzman at ijs.si Mon Mar 2 11:48:09 2026 From: taja.kuzman at ijs.si (=?UTF-8?Q?Taja_Kuzman_Punger=C5=A1ek?=) Date: Mon, 2 Mar 2026 11:48:09 +0100 Subject: [CLASSLA] Release of the new South Slavic CLASSLA-web 2.0 corpora Message-ID: CLASSLA Mailing List *Dear all, We are happy to announce that we have released the second version of the South Slavic CLASSLA-web corpora! The new release represents a substantial update over CLASSLA-web 1.0. The corpus collection contains approximately 38 million texts and 17 billion words, collected from the web in 2024. and covers the full South Slavic language group: Bosnian, Bulgarian, Croatian, Macedonian, Montenegrin, Serbian, and Slovenian. Compared to CLASSLA-web 1.0, the new web corpora are significantly expanded and largely consist of new texts. The corpora are linguistically annotated, automatically classified by genre and enriched with topic labels.**The web corpus collection is intended for a wide range of uses, including corpus linguistics, lexicography, and other linguistic research, as well as for natural language processing tasks such as training and evaluating language models, and creating genre- or topic-specific datasets.****A detailed description of the resource can be found in the accompanying paper (https://doi.org/10.48550/arXiv.2601.11170). **Further information on both CLASSLA-web 1.0 and 2.0 versions, including details on corpus construction, additional resources, a video describing the workflow, and citation guidelines, is available on the CLASSLA-web website: https://clarinsi.github.io/classla-web/ **We invite you to browse the corpora via the CLARIN.SI concordancers (**https://www.clarin.si/ske/#open**) or download them **under a CC0 license **from the CLARIN.SI repository: http://hdl.handle.net/11356/2079* *Best wishes, CLASSLA-web authors: Taja Kuzman Pungeršek, Peter Rupnik, Vít Suchomel and Nikola Ljubešić, supported by CLARIN.SI and CLASSLA* CLASSLA: The Knowledge Centre for South Slavic Languages CLARIN.SI Jožef Stefan Institute Jamova cesta 39, Ljubljana Slovenia -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: EgnHzq0OLDrCZSz9.png Type: image/png Size: 174960 bytes Desc: not available URL: