[CLASSLA] Release of the new South Slavic CLASSLA-web 2.0 corpora
Taja Kuzman Pungeršek
taja.kuzman at ijs.si
Mon Mar 2 11:48:09 CET 2026
CLASSLA Mailing List
*Dear all, We are happy to announce that we have released the second
version of the South Slavic CLASSLA-web corpora! The new release
represents a substantial update over CLASSLA-web 1.0. The corpus
collection contains approximately 38 million texts and 17 billion words,
collected from the web in 2024. and covers the full South Slavic
language group: Bosnian, Bulgarian, Croatian, Macedonian, Montenegrin,
Serbian, and Slovenian. Compared to CLASSLA-web 1.0, the new web corpora
are significantly expanded and largely consist of new texts. The corpora
are linguistically annotated, automatically classified by genre and
enriched with topic labels.**The web corpus collection is intended for a
wide range of uses, including corpus linguistics, lexicography, and
other linguistic research, as well as for natural language processing
tasks such as training and evaluating language models, and creating
genre- or topic-specific datasets.****A detailed description of the
resource can be found in the accompanying paper
(https://doi.org/10.48550/arXiv.2601.11170). **Further information on
both CLASSLA-web 1.0 and 2.0 versions, including details on corpus
construction, additional resources, a video describing the workflow, and
citation guidelines, is available on the CLASSLA-web website:
https://clarinsi.github.io/classla-web/ **We invite you to browse the
corpora via the CLARIN.SI concordancers
(**https://www.clarin.si/ske/#open**) or download them **under a CC0
license **from the CLARIN.SI repository: http://hdl.handle.net/11356/2079*
*Best wishes, CLASSLA-web authors: Taja Kuzman Pungeršek, Peter Rupnik,
Vít Suchomel and Nikola Ljubešić, supported by CLARIN.SI and CLASSLA*
CLASSLA: The Knowledge Centre for South Slavic Languages
<https://www.clarin.si/info/k-centre/>
CLARIN.SI <http://clarin.si/>
Jožef Stefan Institute
Jamova cesta 39, Ljubljana
Slovenia
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.ijs.si/pipermail/classla/attachments/20260302/b0133342/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: EgnHzq0OLDrCZSz9.png
Type: image/png
Size: 174960 bytes
Desc: not available
URL: <https://mailman.ijs.si/pipermail/classla/attachments/20260302/b0133342/attachment-0001.png>
More information about the CLASSLA
mailing list