[CLASSLA] New state-of-the-art version of CLASSLA-Stanza pipeline for linguistic processing of South Slavic languages
Taja Kuzman
taja.kuzman at ijs.si
Wed Sep 13 15:42:25 CEST 2023
CLASSLA Mailing List
*
Dear all,
The CLASSLA Knowledge centre for South Slavic languages
<https://www.clarin.si/info/k-centre/>is delighted to announce the
release of an improved CLASSLA-Stanza pipeline
<https://pypi.org/project/classla/>, which enables state-of-the-art
linguistic processing of Slovenian, Croatian, Serbian, Macedonian and
Bulgarian language.
In addition to covering standard varieties of five South Slavic
languages, the pipeline also provides special modules for linguistic
annotation of non-standard text and web corpora for Slovenian, Croatian
and Serbian. The CLASSLA-Stanza annotation tool supports a total of six
tasks: tokenization, morphosyntactic annotation, lemmatization,
dependency parsing, semantic role labeling, and named-entity
recognition. Some of the main improvements that separate CLASSLA-Stanza
from the Stanza pipeline are:
*
support of external inflectional lexicons which significantly
increases performance on morphologically rich languages;
*
extended training datasets (beyond Universal Dependencies data) for
all included models;
*
use of CLARIN.SI-embed <https://shorturl.at/ehwS2>word embeddings,
trained on significantly larger and more diverse datasets than
embeddings used by Stanza;
*
specific modules for standard, non-standard and web text.
As a result, we are happy to report that the CLASSLA-Stanza
significantly outperforms Stanza, with error reduction between 34% and
98% on the Slovenian official benchmark (see table below which reports
the performance using the Micro F1 score). You can find more details on
the pipeline improvements and training settings in a technical report
“CLASSLA-Stanza: The Next Step for Linguistic Processing of South Slavic
Languages <https://arxiv.org/abs/2308.04255>” (Terčon & Ljubešić, 2023). *
*** *
**
You can use CLASSLA-Stanza as a python library
<https://pypi.org/project/classla/>(documentation is available here
<https://github.com/clarinsi/classla>) or via an online service
<https://orodja.cjvt.si/oznacevalnik/eng/>(currently available for
Slovenian, other languages and modules coming soon). Separate models are
also freely available at the CLARIN.SI repository
<https://shorturl.at/iquyX>.
These results would not be possible without immense efforts in
developing high-quality training datasets together with our
collaborators all around Europe. We wish to use this opportunity to most
warmly thank all of them!
* Best regards, Nikola, Taja, and many other CLASSLAers*
CLASSLA: The Knowledge Centre for South Slavic Languages
<https://www.clarin.si/info/k-centre/>
CLARIN.SI <http://clarin.si/>
Jožef Stefan Institute
Jamova cesta 39, Ljubljana
Slovenia
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.ijs.si/pipermail/classla/attachments/20230913/e419fb21/attachment-0001.htm>
More information about the CLASSLA
mailing list