[CLASSLA] Release of the new South Slavic CLASSLA-web 2.0 corpora

Taja Kuzman Pungeršek taja.kuzman at ijs.si
Mon Mar 2 11:48:09 CET 2026


		
CLASSLA Mailing List

*Dear all, We are happy to announce that we have released the second 
version of the South Slavic CLASSLA-web corpora! The new release 
represents a substantial update over CLASSLA-web 1.0. The corpus 
collection contains approximately 38 million texts and 17 billion words, 
collected from the web in 2024. and covers the full South Slavic 
language group: Bosnian, Bulgarian, Croatian, Macedonian, Montenegrin, 
Serbian, and Slovenian. Compared to CLASSLA-web 1.0, the new web corpora 
are significantly expanded and largely consist of new texts. The corpora 
are linguistically annotated, automatically classified by genre and 
enriched with topic labels.**The web corpus collection is intended for a 
wide range of uses, including corpus linguistics, lexicography, and 
other linguistic research, as well as for natural language processing 
tasks such as training and evaluating language models, and creating 
genre- or topic-specific datasets.****A detailed description of the 
resource can be found in the accompanying paper 
(https://doi.org/10.48550/arXiv.2601.11170). **Further information on 
both CLASSLA-web 1.0 and 2.0 versions, including details on corpus 
construction, additional resources, a video describing the workflow, and 
citation guidelines, is available on the CLASSLA-web website: 
https://clarinsi.github.io/classla-web/ **We invite you to browse the 
corpora via the CLARIN.SI concordancers 
(**https://www.clarin.si/ske/#open**) or download them **under a CC0 
license **from the CLARIN.SI repository: http://hdl.handle.net/11356/2079*
*Best wishes, CLASSLA-web authors: Taja Kuzman Pungeršek, Peter Rupnik, 
Vít Suchomel and Nikola Ljubešić, supported by CLARIN.SI and CLASSLA*

CLASSLA: The Knowledge Centre for South Slavic Languages 
<https://www.clarin.si/info/k-centre/>

CLARIN.SI <http://clarin.si/>

Jožef Stefan Institute

Jamova cesta 39, Ljubljana
Slovenia

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.ijs.si/pipermail/classla/attachments/20260302/b0133342/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: EgnHzq0OLDrCZSz9.png
Type: image/png
Size: 174960 bytes
Desc: not available
URL: <https://mailman.ijs.si/pipermail/classla/attachments/20260302/b0133342/attachment-0001.png>


More information about the CLASSLA mailing list