[CLASSLA] CLASSLA web corpora of Croatian, Serbian and Slovenian

Taja Kuzman taja.kuzman at ijs.si
Fri Jun 23 09:05:31 CEST 2023


		
CLASSLA Mailing List

Dear all,

*

We aredelighted to announce the release of the pilot versions (v0.1) of 
the CLASSLA web corpora for Croatian 
<https://www.clarin.si/ske/#dashboard?corpname=classlaweb_hr>(2.3 
billion words), Serbian 
<https://www.clarin.si/ske/#dashboard?corpname=classlaweb_sr>(2.4 
billion words) and Slovenian 
<https://www.clarin.si/ske/#dashboard?corpname=classlaweb_sl>(1.9 
billion words). The main features of the newly released corpora, aside 
from their massive size and recency (crawled in 2022) is their automatic 
enrichment with genre information 
<https://huggingface.co/classla/xlm-roberta-base-multilingual-text-genre-classifier>and 
their linguistic processing with the improved CLASSLA-Stanza annotation 
pipeline <https://pypi.org/project/classla/>(applied version to be 
released soon). The corpora are available for search via the CLARIN.SI 
concordancers, Crystal NoSketchEngine <https://www.clarin.si/ske/#open>, 
Bonito NoSketchEngine <https://www.clarin.si/noske/>and KonText 
<https://www.clarin.si/kontext/corpora/corplist>. The pilot versions of 
these corpora are intended to gather valuable user feedback, while the 
official release (v1.0) of the three existing corpora, along with web 
corpora for Bosnian, Montenegrin, Macedonian, and Bulgarian, is 
scheduled for later this year.


We warmly welcome you to explore our corpora. Please reach out to us at 
helpdesk.classla at clarin.si <mailto:helpdesk.classla at clarin.si>with any 
ideas for improvements —we will try hard to implement them in the 
upcoming official release already! We also encourage you to share with 
us how you plan to use these corpora in your research, as well as any 
other use cases you may have in mind.


To give you some ideas on how the corpora can be used in your research 
you are invited to read our blog post on the use of CLASSLA web corpora 
via the open CLARIN.SI concordancers 
<https://www.clarin.si/info/k-centre/classla-web-bigger-and-better-web-corpora-for-croatian-serbian-and-slovenian-on-clarin-si-concordancers/>. 
The step-by-step tutorial covers a wide range of functionalities of the 
concordancers, including finding collocations in different genres, 
analyzing word statistics, and exploring the use of non-standard words. 
This resource is particularly suited for linguists, language teachers 
and digital humanists.***

Best regards,

Nikola, Taja, and many other CLASSLAers

CLASSLA: The Knowledge Centre for South Slavic Languages 
<https://www.clarin.si/info/k-centre/>

CLARIN.SI <http://clarin.si/>

Jožef Stefan Institute

Jamova cesta 39, Ljubljana
Slovenia

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.ijs.si/pipermail/classla/attachments/20230623/98a01136/attachment-0001.htm>


More information about the CLASSLA mailing list