[CLASSLA] Christmas came early this year!
Taja Kuzman
taja.kuzman at ijs.si
Wed Dec 6 08:05:17 CET 2023
CLASSLA Mailing List
Dear all,
*
The CLASSLA Knowledge centre for South Slavic languages
<https://www.clarin.si/info/k-centre/>is delighted to announce the
release of comparable web corpora for all official South Slavic
languages, namely Slovenian
<https://www.clarin.si/ske/#concordance?corpname=classlaweb_sl>,
Croatian
<https://www.clarin.si/ske/#concordance?corpname=classlaweb_hr>, Bosnian
<https://www.clarin.si/ske/#concordance?corpname=classlaweb_bs>,
Montenegrin
<https://www.clarin.si/ske/#concordance?corpname=classlaweb_cnr>,
Serbian <https://www.clarin.si/ske/#concordance?corpname=classlaweb_sr>,
Macedonian
<https://www.clarin.si/ske/#concordance?corpname=classlaweb_mk>and
Bulgarian
<https://www.clarin.si/ske/#concordance?corpname=classlaweb_bg>, all the
corpora summing up to almost 11 billion words! The corpora are freely
available on the CLARIN.SI NoSketch Engine
<https://www.clarin.si/ske/#open>concordancer (see our recent tutorial
on how to easily query the CLASSLA web corpora and perform statistical
analyses via the concordancer
<https://www.clarin.si/info/k-centre/classla-web-bigger-and-better-web-corpora-for-croatian-serbian-and-slovenian-on-clarin-si-concordancers/>).
This collection of corpora is very innovative, due to the following reasons:
*
This is, to the best of our knowledge, the first collection of
comparable web corpora covering a whole language group.
*
The collection includes the first general, linguistically annotated
corpora for two out of seven languages, namely Montenegrin and
Macedonian.
*
The comparability of the corpora is ensured by performing data
collection and filtering in the same time period with the same
technologies. Furthermore, the corpora underwent a uniform
linguistic processing via the CLASSLA-Stanza
<https://pypi.org/project/classla/>toolkit, which you can now try
out also through the CLASSLA annotator web interface
<https://clarin.si/oznacevalnik/eng>.
*
Each of the documents in each of the corpora is annotated with the
X-GENRE multilingual genre classifier
<https://huggingface.co/classla/xlm-roberta-base-multilingual-text-genre-classifier>.
The normalized distribution of genre labels inside the CLASSLA web
corpora are presented in the following figure.
For more details, we warmly invite you to read our new blog post which
introduces the CLASSLA-web corpora
<https://www.clarin.si/info/k-centre/comparable-classla-web-corpora-of-south-slavic-languages/>.
The blog post provides more details on the corpora sizes and interesting
insights on the correlations between genre distributions and GDP per
capita across the seven South Slavic countries.
We will be very glad to obtain feedback on our corpora and annotation
technology. As usual, please write to us on helpdesk.classla at clarin.si
<mailto:helpdesk.classla at clarin.si>!
These corpora would not have been released without great collaboration
inside the CLASSLA Knowledge centre for South Slavic languages, which
includes the Slovenian consortium CLARIN.SI
<https://www.clarin.si/info/about/>, the Institute of Croatian Language
<http://ihjj.hr>, and the Bulgarian consortium CLADA-BG
<https://clada-bg.eu/en/>. Furthermore, very crucial were the
longstanding collaboration with the ReLDI centre
<https://reldi.spur.uzh.ch>on a series of South Slavic languages, and
Biljana Stojanovska and Katerina Zdravkova on Macedonian. On this
occasion, we want to thank everyone for the collaboration, and invite
others to join our common efforts!***
We wish you happy holidays and a prosperous New Year,
Nikola, Taja, and many other CLASSLAers
CLASSLA: The Knowledge Centre for South Slavic Languages
<https://www.clarin.si/info/k-centre/>
CLARIN.SI <http://clarin.si/>
Jožef Stefan Institute
Jamova cesta 39, Ljubljana
Slovenia
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.ijs.si/pipermail/classla/attachments/20231206/bc669567/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: EgnHzq0OLDrCZSz9.png
Type: image/png
Size: 174960 bytes
Desc: not available
URL: <https://mailman.ijs.si/pipermail/classla/attachments/20231206/bc669567/attachment-0001.png>
More information about the CLASSLA
mailing list