<html><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /></head><body style='font-size: 10pt; font-family: Verdana,Geneva,sans-serif'>
<div dir="ltr">
<div dir="ltr">
<div dir="ltr">
<div dir="ltr">
<table border="0" width="640" cellspacing="0" cellpadding="0" align="center">
<tbody>
<tr>
<td>
<table border="0" width="640" cellspacing="0" cellpadding="0" align="center" bgcolor="#ffffff">
<tbody>
<tr>
<td>
<table style="width: 640px; min-width: 640px;" border="0" width="640" cellspacing="0" cellpadding="0" align="center" bgcolor="#ffffff">
<tbody>
<tr>
<td>
<table style="width: 640px; min-width: 640px;" border="0" width="640" cellspacing="0" cellpadding="0" align="center">
<tbody>
<tr>
<td style="padding: 0px 40px;" align="center">
<table border="0" width="100%" cellspacing="0" cellpadding="0" align="center">
<tbody>
<tr>
<td align="left" valign="middle" width="110"><img id="m_925030267947577449gmail-m_6504557075424313283gmail-m_-5089897522223699477logoBlock-4" class="gmail_canned_response_image" style="display: block;" src="https://bucket.mlcdn.com/a/3476/3476114/images/d542a766ebbbc112d5bc5d9e40be271b526a92c6.jpeg" width="110" border="0" /></td>
<td width="20" height="1"> </td>
<td align="right" valign="middle">
<table border="0" width="100%" cellspacing="0" cellpadding="0" align="center">
<tbody>
<tr>
<td style="font-family: Poppins,sans-serif; font-size: 21px; line-height: 31.5px; font-weight: bold; color: #0080ad;" align="right">CLASSLA Mailing List</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
<table style="width: 640px; min-width: 640px;" border="0" width="640" cellspacing="0" cellpadding="0" align="center">
<tbody>
<tr>
<td style="line-height: 10px; min-height: 10px;" height="10"> </td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
<table border="0" width="640" cellspacing="0" cellpadding="0" align="center" bgcolor="#ffffff">
<tbody>
<tr>
<td>
<table style="width: 640px; min-width: 640px;" border="0" width="640" cellspacing="0" cellpadding="0" align="center" bgcolor="#ffffff">
<tbody>
<tr>
<td>
<table style="width: 640px; min-width: 640px;" border="0" width="640" cellspacing="0" cellpadding="0" align="center">
<tbody>
<tr>
<td align="center">
<table style="border-top: 3px double #ededf3; border-collapse: initial;" border="0" width="100%" cellspacing="0" cellpadding="0" align="center">
<tbody>
<tr>
<td style="line-height: 0px; min-height: 0px;" height="0"> </td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
<table border="0" width="640" cellspacing="0" cellpadding="0" align="center" bgcolor="#ffffff">
<tbody>
<tr>
<td>
<table style="width: 640px; min-width: 640px;" border="0" width="640" cellspacing="0" cellpadding="0" align="center" bgcolor="#ffffff">
<tbody>
<tr>
<td>
<table style="width: 640px; min-width: 640px;" border="0" width="640" cellspacing="0" cellpadding="0" align="center">
<tbody>
<tr>
<td style="line-height: 10px; min-height: 10px;" height="10"> </td>
</tr>
</tbody>
</table>
<table style="width: 640px; min-width: 640px;" border="0" width="640" cellspacing="0" cellpadding="0" align="center">
<tbody>
<tr>
<td style="padding: 0px 40px;" align="center">
<table style="border-radius: 2px;" border="0" width="560" cellspacing="0" cellpadding="0" align="center">
<tbody>
<tr>
<td style="padding: 0px 40px; border: 1px solid #e6e6e6; border-radius: 2px;" align="center" bgcolor="#FCFCFC">
<table border="0" width="100%" cellspacing="0" cellpadding="0" align="center">
<tbody>
<tr>
<td height="30"> </td>
</tr>
<tr>
<td id="m_925030267947577449gmail-m_6504557075424313283gmail-m_-5089897522223699477bodyText-8" style="font-family: Poppins,sans-serif; font-size: 14px; line-height: 21px; color: #000000;">
<p><span style="font-weight: 400;">Hi all,</span></p>
<p><span style="font-weight: 400;">Happy Friday! We are happy to announce that new high-quality monolingual and parallel web corpora for South Slavic languages have been released. The corpora were created in scope of the </span><a href="https://macocu.eu/" target="_blank" rel="noopener noreferrer"><span style="font-weight: 400;">MaCoCu</span></a><span style="font-weight: 400;"> project, which focuses on collecting monolingual and parallel data from the Internet for European under-resourced languages, South Slavic languages included.</span></p>
<p><span style="font-weight: 400;">The datasets were built by crawling the national top-level domains, extending the crawl dynamically to other domains as well. Considerable efforts were devoted into cleaning the extracted text to provide high-quality web corpora, including boilerplate removal, identification of near-duplicated paragraphs, discarding short texts and texts that are not in the target language, and manual check-ups of some of the corpora. More information on the corpora construction and links to the freely-available tools that were used for crawling and cleaning can be found in the description of resources, published on the CLARIN.SI repository (see links below).</span></p>
<p><span style="font-weight: 400;">The following new South Slavic corpora are freely available from the CLARIN.SI repository:</span></p>
<ul>
<li style="font-weight: 400;"><a href="http://hdl.handle.net/11356/1516" target="_blank" rel="noopener noreferrer"><span style="font-weight: 400;">Croatian web corpus MaCoCu-hr 1.0</span></a><span style="font-weight: 400;"> with 2.3 billion words in 7 million texts;</span></li>
<li style="font-weight: 400;"><a href="http://hdl.handle.net/11356/1517" target="_blank" rel="noopener noreferrer"><span style="font-weight: 400;">Slovene web corpus MaCoCu-sl 1.0</span></a><span style="font-weight: 400;"> with 1.8 billion words in 5.8 million texts;</span></li>
<li style="font-weight: 400;"><a href="http://hdl.handle.net/11356/1512" target="_blank" rel="noopener noreferrer"><span style="font-weight: 400;">Macedonian web corpus MaCoCu-mk 1.0</span></a><span style="font-weight: 400;"> with 0.5 billion words in 1.96 million texts;</span></li>
<li style="font-weight: 400;"><a href="http://hdl.handle.net/11356/1515" target="_blank" rel="noopener noreferrer"><span style="font-weight: 400;">Bulgarian web corpus MaCoCu-bg 1.0</span></a><span style="font-weight: 400;"> with 3.5 billion words in 10.5 million texts;</span></li>
<li style="font-weight: 400;"><a href="http://hdl.handle.net/11356/1522" target="_blank" rel="noopener noreferrer"><span style="font-weight: 400;">Croatian-English parallel corpus MaCoCu-hr-en 1.0</span></a><span style="font-weight: 400;"> with 135 million words in 3 million segments (sentence pairs);</span></li>
<li style="font-weight: 400;"><a href="http://hdl.handle.net/11356/1523" target="_blank" rel="noopener noreferrer"><span style="font-weight: 400;">Slovene-English parallel corpus MaCoCu-sl-en 1.0</span></a><span style="font-weight: 400;"> with 137 million words in 3 million segments;</span></li>
<li style="font-weight: 400;"><a href="http://hdl.handle.net/11356/1513" target="_blank" rel="noopener noreferrer"><span style="font-weight: 400;">Macedonian-English parallel corpus MaCoCu-mk-en 1.0</span></a><span style="font-weight: 400;"> with 24 million words in 0.48 million segments;</span></li>
<li style="font-weight: 400;"><a href="http://hdl.handle.net/11356/1521" target="_blank" rel="noopener noreferrer"><span style="font-weight: 400;">Bulgarian-English parallel corpus MaCoCu-bg-en 1.0</span></a><span style="font-weight: 400;"> with 159 million words in 3.9 million segments.</span></li>
</ul>
<p><span style="font-weight: 400;">We are already working on using the above datasets for BERT-like language model pre-training, and producing linguistically-annotated corpora that will be available through our concordancers.</span></p>
<p><span style="font-weight: 400;">Next year, the corpora will be upgraded and additional South Slavic monolingual and parallel corpora will be released, i.e., Bosnian, Serbian and Montenegrin. In the meantime, if you use the corpora in your research, we would be very happy to hear good as well as bad reviews.</span></p>
<p><br /></p>
<p><span style="font-weight: 400;">Best regards,</span></p>
<p><span style="font-weight: 400;">The CLASSLA Team</span></p>
</td>
</tr>
<tr>
<td height="30"> </td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
<table style="width: 640px; min-width: 640px;" border="0" width="640" cellspacing="0" cellpadding="0" align="center">
<tbody>
<tr>
<td style="line-height: 10px; min-height: 10px;" height="10"> </td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
<table border="0" width="640" cellspacing="0" cellpadding="0" align="center" bgcolor="#e6f4ff">
<tbody>
<tr>
<td>
<table style="width: 640px; min-width: 640px;" border="0" width="640" cellspacing="0" cellpadding="0" align="center" bgcolor="#e6f4ff">
<tbody>
<tr>
<td>
<table style="width: 640px; min-width: 640px;" border="0" width="640" cellspacing="0" cellpadding="0" align="center">
<tbody>
<tr>
<td style="line-height: 20px; min-height: 20px;" height="20"> </td>
</tr>
</tbody>
</table>
<table style="width: 640px; min-width: 640px;" border="0" width="640" cellspacing="0" cellpadding="0" align="center">
<tbody>
<tr>
<td style="padding: 0px 40px;" align="center">
<table border="0" width="100%" cellspacing="0" cellpadding="0" align="center">
<tbody>
<tr>
<td style="font-family: Poppins,sans-serif; font-size: 14px; font-weight: bold; line-height: 21px; color: #111111;" align="left"><a href="https://www.clarin.si/info/k-centre/" target="_blank" rel="noopener noreferrer">CLASSLA: The Knowledge Centre for South Slavic Languages</a></td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
<table style="width: 640px; min-width: 640px;" border="0" width="640" cellspacing="0" cellpadding="0" align="center">
<tbody>
<tr>
<td height="10"> </td>
</tr>
</tbody>
</table>
<table style="width: 640px; min-width: 640px;" border="0" width="640" cellspacing="0" cellpadding="0" align="center">
<tbody>
<tr>
<td style="padding: 0px 40px;" align="center">
<table border="0" width="100%" cellspacing="0" cellpadding="0" align="center">
<tbody>
<tr>
<td align="center">
<table style="width: 267px; min-width: 267px;" border="0" width="267" cellspacing="0" cellpadding="0" align="left">
<tbody>
<tr>
<td id="m_925030267947577449gmail-m_6504557075424313283gmail-m_-5089897522223699477footerText-10" style="font-family: Poppins,sans-serif; font-size: 12px; line-height: 18px; color: #111111;" align="left">
<p style="margin-top: 0px; margin-bottom: 10px;"><a href="http://clarin.si/" target="_blank" rel="noopener noreferrer">CLARIN.SI</a></p>
<p style="margin-top: 0px; margin-bottom: 10px;">Jožef Stefan Institute</p>
<p style="margin-top: 0px; margin-bottom: 0px;">Jamova cesta 39, Ljubljana<br />Slovenia</p>
</td>
</tr>
<tr>
<td height="25"> </td>
</tr>
</tbody>
</table>
<table style="width: 267px; min-width: 267px;" border="0" width="267" cellspacing="0" cellpadding="0" align="right">
<tbody>
<tr>
<td id="m_925030267947577449gmail-m_6504557075424313283gmail-m_-5089897522223699477footerUnsubscribeText-10" style="font-family: Poppins,sans-serif; font-size: 12px; line-height: 18px; color: #111111;" align="right">
<p style="margin-top: 0px; margin-bottom: 0px;"><br /><span style="font-size: 10px;"></span></p>
</td>
</tr>
<tr>
<td height="10"> </td>
</tr>
<tr>
<td style="font-family: Poppins,sans-serif; font-size: 12px; line-height: 18px; color: #111111;" align="right"> </td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<div dir="ltr"> </div>
</div>
</div>
<div class="pre" style="margin: 0; padding: 0; font-family: monospace;"> </div>
</body></html>