[CLASSLA] CLASSLA web corpora of Croatian, Serbian and Slovenian
Taja Kuzman
taja.kuzman at ijs.si
Fri Jun 23 09:05:31 CEST 2023
CLASSLA Mailing List
Dear all,
*
We aredelighted to announce the release of the pilot versions (v0.1) of
the CLASSLA web corpora for Croatian
<https://www.clarin.si/ske/#dashboard?corpname=classlaweb_hr>(2.3
billion words), Serbian
<https://www.clarin.si/ske/#dashboard?corpname=classlaweb_sr>(2.4
billion words) and Slovenian
<https://www.clarin.si/ske/#dashboard?corpname=classlaweb_sl>(1.9
billion words). The main features of the newly released corpora, aside
from their massive size and recency (crawled in 2022) is their automatic
enrichment with genre information
<https://huggingface.co/classla/xlm-roberta-base-multilingual-text-genre-classifier>and
their linguistic processing with the improved CLASSLA-Stanza annotation
pipeline <https://pypi.org/project/classla/>(applied version to be
released soon). The corpora are available for search via the CLARIN.SI
concordancers, Crystal NoSketchEngine <https://www.clarin.si/ske/#open>,
Bonito NoSketchEngine <https://www.clarin.si/noske/>and KonText
<https://www.clarin.si/kontext/corpora/corplist>. The pilot versions of
these corpora are intended to gather valuable user feedback, while the
official release (v1.0) of the three existing corpora, along with web
corpora for Bosnian, Montenegrin, Macedonian, and Bulgarian, is
scheduled for later this year.
We warmly welcome you to explore our corpora. Please reach out to us at
helpdesk.classla at clarin.si <mailto:helpdesk.classla at clarin.si>with any
ideas for improvements —we will try hard to implement them in the
upcoming official release already! We also encourage you to share with
us how you plan to use these corpora in your research, as well as any
other use cases you may have in mind.
To give you some ideas on how the corpora can be used in your research
you are invited to read our blog post on the use of CLASSLA web corpora
via the open CLARIN.SI concordancers
<https://www.clarin.si/info/k-centre/classla-web-bigger-and-better-web-corpora-for-croatian-serbian-and-slovenian-on-clarin-si-concordancers/>.
The step-by-step tutorial covers a wide range of functionalities of the
concordancers, including finding collocations in different genres,
analyzing word statistics, and exploring the use of non-standard words.
This resource is particularly suited for linguists, language teachers
and digital humanists.***
Best regards,
Nikola, Taja, and many other CLASSLAers
CLASSLA: The Knowledge Centre for South Slavic Languages
<https://www.clarin.si/info/k-centre/>
CLARIN.SI <http://clarin.si/>
Jožef Stefan Institute
Jamova cesta 39, Ljubljana
Slovenia
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.ijs.si/pipermail/classla/attachments/20230623/98a01136/attachment-0001.htm>
More information about the CLASSLA
mailing list