[CLASSLA] New massive monolingual and parallel South Slavic web corpora
Taja Kuzman
Taja.Kuzman at ijs.si
Fri May 6 14:37:39 CEST 2022
CLASSLA Mailing List
Hi all,
Happy Friday! We are happy to announce that new high-quality monolingual
and parallel web corpora for South Slavic languages have been released.
The corpora were created in scope of the MaCoCu [1] project, which
focuses on collecting monolingual and parallel data from the Internet
for European under-resourced languages, South Slavic languages included.
The datasets were built by crawling the national top-level domains,
extending the crawl dynamically to other domains as well. Considerable
efforts were devoted into cleaning the extracted text to provide
high-quality web corpora, including boilerplate removal, identification
of near-duplicated paragraphs, discarding short texts and texts that are
not in the target language, and manual check-ups of some of the corpora.
More information on the corpora construction and links to the
freely-available tools that were used for crawling and cleaning can be
found in the description of resources, published on the CLARIN.SI
repository (see links below).
The following new South Slavic corpora are freely available from the
CLARIN.SI repository:
* Croatian web corpus MaCoCu-hr 1.0 [2] with 2.3 billion words in 7
million texts;
* Slovene web corpus MaCoCu-sl 1.0 [3] with 1.8 billion words in 5.8
million texts;
* Macedonian web corpus MaCoCu-mk 1.0 [4] with 0.5 billion words in
1.96 million texts;
* Bulgarian web corpus MaCoCu-bg 1.0 [5] with 3.5 billion words in
10.5 million texts;
* Croatian-English parallel corpus MaCoCu-hr-en 1.0 [6] with 135
million words in 3 million segments (sentence pairs);
* Slovene-English parallel corpus MaCoCu-sl-en 1.0 [7] with 137
million words in 3 million segments;
* Macedonian-English parallel corpus MaCoCu-mk-en 1.0 [8] with 24
million words in 0.48 million segments;
* Bulgarian-English parallel corpus MaCoCu-bg-en 1.0 [9] with 159
million words in 3.9 million segments.
We are already working on using the above datasets for BERT-like
language model pre-training, and producing linguistically-annotated
corpora that will be available through our concordancers.
Next year, the corpora will be upgraded and additional South Slavic
monolingual and parallel corpora will be released, i.e., Bosnian,
Serbian and Montenegrin. In the meantime, if you use the corpora in your
research, we would be very happy to hear good as well as bad reviews.
Best regards,
The CLASSLA Team
CLASSLA: The Knowledge Centre for South Slavic Languages [10]
CLARIN.SI [11]
Jožef Stefan Institute
Jamova cesta 39, Ljubljana
Slovenia
Links:
------
[1] https://macocu.eu/
[2] http://hdl.handle.net/11356/1516
[3] http://hdl.handle.net/11356/1517
[4] http://hdl.handle.net/11356/1512
[5] http://hdl.handle.net/11356/1515
[6] http://hdl.handle.net/11356/1522
[7] http://hdl.handle.net/11356/1523
[8] http://hdl.handle.net/11356/1513
[9] http://hdl.handle.net/11356/1521
[10] https://www.clarin.si/info/k-centre/
[11] http://clarin.si/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.ijs.si/pipermail/classla/attachments/20220506/25ce10bc/attachment.htm>
More information about the CLASSLA
mailing list