[CLASSLA] A web corpus intermezzo

Nikola Ljubešić nljubesi at gmail.com
Tue Apr 25 21:05:15 CEST 2023


Dear all,

We were keeping rather silent for some time now due to many developments
that required our full capacity. But you can expect reports on interesting
resources, tools, and experiments in the following months!

We were, however, not the only ones who were very busy in the previous
period. Philipp Wasserscheidt has recently published the PDRS web corpus of
Serbian language, 715 million tokens in size. You can find more details on
the corpus in the CLARIN.SI repository entry (
http://hdl.handle.net/11356/1752) where the corpus is available for
download. The corpus is also available via the CLARIN.SI concordancers
(NoSkE link is
https://www.clarin.si/noske/run.cgi/corp_info?corpname=pdrs10&struct_attr_stats=1
).

Philipp is also making sure that future users know how to use the corpus.
This is slightly last-minute, but maybe still not too late for some of you
- a workshop on the PDRS web corpus usage will be held from this Thursday
to Saturday in Belgrade. More information is available at
https://javnidiskurs.rs/poziv-na-radionicu-pdrs-1-0/.

Since we are on the topic of web corpora, we have two pieces of news to
share right away as well:

1. I have taken one of the leading roles in the ACL Special Interest Group
for Web as a Corpus (SIGWAC). If you are interested in this area of
research, you should join the SIG by signing up to the mailing list at
http://devel.sslmit.unibo.it/mailman/listinfo/sigwac.

2. We are in the process of releasing the MaCoCu datasets, which are web
crawls of various national top-level domains, including those of Slovenia,
Croatia, Bosnia and Herzegovina, Montenegro, Serbia, Macedonia and
Bulgaria. We are sharing here the link just to the Macedonian dataset -
http://hdl.handle.net/11356/1801. Linguistic processing of the datasets has
just started, and will result in the CLASSLA web corpora, to be updated on
a biyearly basis.

Warm greetings from Zagreb (if anyone will be in Dubrovnik next week for
EACL, let me know, we might do a meet-up, similar to JTDH in Ljubljana),

Nikola
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.ijs.si/pipermail/classla/attachments/20230425/1f934a69/attachment.htm>


More information about the CLASSLA mailing list