[CLASSLA] New state-of-the-art version of CLASSLA-Stanza pipeline for linguistic processing of South Slavic languages

Wed Sep 13 15:42:25 CEST 2023

CLASSLA Mailing List

*

Dear all,

The CLASSLA Knowledge centre for South Slavic languages 
<https://www.clarin.si/info/k-centre/>is delighted to announce the 
release of an improved CLASSLA-Stanza pipeline 
<https://pypi.org/project/classla/>, which enables state-of-the-art 
linguistic processing of Slovenian, Croatian, Serbian, Macedonian and 
Bulgarian language.

In addition to covering standard varieties of five South Slavic 
languages, the pipeline also provides special modules for linguistic 
annotation of non-standard text and web corpora for Slovenian, Croatian 
and Serbian. The CLASSLA-Stanza annotation tool supports a total of six 
tasks: tokenization, morphosyntactic annotation, lemmatization, 
dependency parsing, semantic role labeling, and named-entity 
recognition. Some of the main improvements that separate CLASSLA-Stanza 
from the Stanza pipeline are:

  *

    support of external inflectional lexicons which significantly
    increases performance on morphologically rich languages;

  *

    extended training datasets (beyond Universal Dependencies data) for
    all included models;

  *

    use of CLARIN.SI-embed <https://shorturl.at/ehwS2>word embeddings,
    trained on significantly larger and more diverse datasets than
    embeddings used by Stanza;

  *

    specific modules for standard, non-standard and web text.

As a result, we are happy to report that the CLASSLA-Stanza 
significantly outperforms Stanza, with error reduction between 34% and 
98% on the Slovenian official benchmark (see table below which reports 
the performance using the Micro F1 score). You can find more details on 
the pipeline improvements and training settings in a technical report 
“CLASSLA-Stanza: The Next Step for Linguistic Processing of South Slavic 
Languages <https://arxiv.org/abs/2308.04255>” (Terčon & Ljubešić, 2023). *
*** *
**

You can use CLASSLA-Stanza as a python library 
<https://pypi.org/project/classla/>(documentation is available here 
<https://github.com/clarinsi/classla>) or via an online service 
<https://orodja.cjvt.si/oznacevalnik/eng/>(currently available for 
Slovenian, other languages and modules coming soon). Separate models are 
also freely available at the CLARIN.SI repository 
<https://shorturl.at/iquyX>.

These results would not be possible without immense efforts in 
developing high-quality training datasets together with our 
collaborators all around Europe. We wish to use this opportunity to most 
warmly thank all of them!

* Best regards, Nikola, Taja, and many other CLASSLAers*

CLASSLA: The Knowledge Centre for South Slavic Languages 
<https://www.clarin.si/info/k-centre/>

CLARIN.SI <http://clarin.si/>

Jožef Stefan Institute

Jamova cesta 39, Ljubljana
Slovenia

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.ijs.si/pipermail/classla/attachments/20230913/e419fb21/attachment-0001.htm>