[CLASSLA] Looking forward to 2023!
Taja Kuzman
Taja.Kuzman at ijs.si
Thu Dec 22 11:19:37 CET 2022
CLASSLA Mailing List
Dear all,
We wanted to wish all of you happy holidays and a successful 2023. To
wrap-up a very busy, but also a very successful 2022, we are sharing
with you what we will be releasing in the first half of the next year.
We are working on releasing a new version of our CLASSLA-Stanza tool
[1], with the following improvements:
* Minor improvements on usability and programming interface
* New Slovenian models for standard and Internet non-standard
language, but also for spoken language (transcripts), most of the
improvements being the results of the VERY successful RSDO project [2]
* New standard and non-standard models for Croatian and Serbian, as we
are constantly working on improving our data [3] (it is a never-ending
game)
* Drastically improved standard model for Macedonian (we resolved
numerous errors by extending the training data (previous model was
trained only on an Orwell's novel))
* We will also release the tool through a web interface and a web
service, similar to the RSDO interface for linguistic processing of
Slovenian [4] (which also uses CLASSLA-Stanza)
Inside the ParlaMint project [5] we are working hard on releasing
parliamentary corpora for the Slovenian, Croatian, Bosnian, Serbian and
Bulgarian parliaments, which is one of the big coordination successes of
the CLASSLA K-centre. Just for comparison, in the first iteration of the
ParlaMint corpus [6], there was only one term of the Croatian parliament
covered, while now the corpus will cover six terms. Bosnian and Serbian
were not part of the first iteration of the ParlaMint corpus.
Finally, we will also publish our new generation of web corpora, called
CLASSLA-web. We already have prepared the raw data for Slovenian [7]
(1.8 billion words), Croatian [8] (2.3 billion words), Macedonian [9]
(524 million words), and Bulgarian [10] (3.5 billion words), but will
release the corpora both for download and through concordancers once we
have all the languages fully processed (we are currently processing
Bosnian, Montenegrin and Serbian) and data annotated with the latest
version of CLASSLA-Stanza.
Happy holidays everyone!
Nikola, Taja, and many other CLASSLAers
CLASSLA: The Knowledge Centre for South Slavic Languages [11]
CLARIN.SI [12]
Jožef Stefan Institute
Jamova cesta 39, Ljubljana
Slovenia
Links:
------
[1] https://pypi.org/project/classla/
[2] https://rsdo.slovenscina.eu
[3] https://github.com/reldi-data
[4] https://orodja.cjvt.si/oznacevalnik/eng/
[5] https://www.clarin.eu/parlamint
[6] https://link.springer.com/article/10.1007/s10579-021-09574-0
[7] http://hdl.handle.net/11356/1517
[8] http://hdl.handle.net/11356/1516
[9] http://hdl.handle.net/11356/1512
[10] http://hdl.handle.net/11356/1515
[11] https://www.clarin.si/info/k-centre/
[12] http://clarin.si/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.ijs.si/pipermail/classla/attachments/20221222/3c00e64a/attachment.htm>
More information about the CLASSLA
mailing list