[CLASSLA] Christmas came early this year!

Taja Kuzman taja.kuzman at ijs.si
Wed Dec 6 08:05:17 CET 2023


		
CLASSLA Mailing List

Dear all,

*

The CLASSLA Knowledge centre for South Slavic languages 
<https://www.clarin.si/info/k-centre/>is delighted to announce the 
release of comparable web corpora for all official South Slavic 
languages, namely Slovenian 
<https://www.clarin.si/ske/#concordance?corpname=classlaweb_sl>, 
Croatian 
<https://www.clarin.si/ske/#concordance?corpname=classlaweb_hr>, Bosnian 
<https://www.clarin.si/ske/#concordance?corpname=classlaweb_bs>, 
Montenegrin 
<https://www.clarin.si/ske/#concordance?corpname=classlaweb_cnr>, 
Serbian <https://www.clarin.si/ske/#concordance?corpname=classlaweb_sr>, 
Macedonian 
<https://www.clarin.si/ske/#concordance?corpname=classlaweb_mk>and 
Bulgarian 
<https://www.clarin.si/ske/#concordance?corpname=classlaweb_bg>, all the 
corpora summing up to almost 11 billion words! The corpora are freely 
available on the CLARIN.SI NoSketch Engine 
<https://www.clarin.si/ske/#open>concordancer (see our recent tutorial 
on how to easily query the CLASSLA web corpora and perform statistical 
analyses via the concordancer 
<https://www.clarin.si/info/k-centre/classla-web-bigger-and-better-web-corpora-for-croatian-serbian-and-slovenian-on-clarin-si-concordancers/>).


This collection of corpora is very innovative, due to the following reasons:

  *

    This is, to the best of our knowledge, the first collection of
    comparable web corpora covering a whole language group.

  *

    The collection includes the first general, linguistically annotated
    corpora for two out of seven languages, namely Montenegrin and
    Macedonian.

  *

    The comparability of the corpora is ensured by performing data
    collection and filtering in the same time period with the same
    technologies. Furthermore, the corpora underwent a uniform
    linguistic processing via the CLASSLA-Stanza
    <https://pypi.org/project/classla/>toolkit, which you can now try
    out also through the CLASSLA annotator web interface
    <https://clarin.si/oznacevalnik/eng>.

  *

    Each of the documents in each of the corpora is annotated with the
    X-GENRE multilingual genre classifier
    <https://huggingface.co/classla/xlm-roberta-base-multilingual-text-genre-classifier>.
    The normalized distribution of genre labels inside the CLASSLA web
    corpora are presented in the following figure.



For more details, we warmly invite you to read our new blog post which 
introduces the CLASSLA-web corpora 
<https://www.clarin.si/info/k-centre/comparable-classla-web-corpora-of-south-slavic-languages/>. 
The blog post provides more details on the corpora sizes and interesting 
insights on the correlations between genre distributions and GDP per 
capita across the seven South Slavic countries.


We will be very glad to obtain feedback on our corpora and annotation 
technology. As usual, please write to us on helpdesk.classla at clarin.si 
<mailto:helpdesk.classla at clarin.si>!


These corpora would not have been released without great collaboration 
inside the CLASSLA Knowledge centre for South Slavic languages, which 
includes the Slovenian consortium CLARIN.SI 
<https://www.clarin.si/info/about/>, the Institute of Croatian Language 
<http://ihjj.hr>, and the Bulgarian consortium CLADA-BG 
<https://clada-bg.eu/en/>. Furthermore, very crucial were the 
longstanding collaboration with the ReLDI centre 
<https://reldi.spur.uzh.ch>on a series of South Slavic languages, and 
Biljana Stojanovska and Katerina Zdravkova on Macedonian. On this 
occasion, we want to thank everyone for the collaboration, and invite 
others to join our common efforts!***
We wish you happy holidays and a prosperous New Year,

Nikola, Taja, and many other CLASSLAers

CLASSLA: The Knowledge Centre for South Slavic Languages 
<https://www.clarin.si/info/k-centre/>

CLARIN.SI <http://clarin.si/>

Jožef Stefan Institute

Jamova cesta 39, Ljubljana
Slovenia

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.ijs.si/pipermail/classla/attachments/20231206/bc669567/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: EgnHzq0OLDrCZSz9.png
Type: image/png
Size: 174960 bytes
Desc: not available
URL: <https://mailman.ijs.si/pipermail/classla/attachments/20231206/bc669567/attachment-0001.png>


More information about the CLASSLA mailing list