[CLASSLA] Some last-minute holiday greetings

Sun Dec 24 13:58:21 CET 2023

This year has been extremely packed with activities, hence this very
last-minute cheer - we are on a good track to become much less of a
less-resourced language family! I give a few examples that come to mind
first.

- Macedonian has arrived to Universal Dependencies 🥳
<https://universaldependencies.org/treebanks/mk_mtb/index.html>🥳🥳 thanks
to Vladimir Cvetkoski, this may be "only" 155 sentences and 1.360 tokens,
but, hey - it is infinitely more than there was before. Bravo, Vladimir!

- CLASSLA followed the great example of Vladimir and decided to publish
SETimes.MK <http://hdl.handle.net/11356/1886> in its current status as
version 0.1, 570 sentences and 13.310 tokens in size, annotated on XPOS,
UPOS, FEATS and LEMMA level, to give additional momentum to the positively
developing situation for Macedonian.

- In Slovenia the PoVeJMo project <https://www.cjvt.si/povejmo/> has
started, focused on adapting an LLM to Slovenian language in general, as
well as adapting it to a series of industrial use cases.

- Andrija Sagić, a multimedia enthusiast, is seriously biting in the speech
apple, additionally fine-tuning the really great whisper-large-v3 model
<https://huggingface.co/Sagicc/whisper-large-v3-sr-cmb> on all the data he
can scrape together for Serbian, which mostly includes our Južne Vesti
dataset <http://hdl.handle.net/11356/1679>. We are now working with Andrija
on improving the dataset (quite many typos in the human transcript!) and
are looking forward to jointly publishing a version 2.0. This is the type
of collaboration we are very much in need of!

- The ReLDI team has started, together with ICEF, Belgrade, the
industry-funded (you do not see many of those!) ComText.SR project
<https://icef-nlp.github.io/COMtext.SR/> on collecting, curating,
annotating and publicly releasing textual data for various domains of
special interest to the industry.

- The JeRTeh society has started publishing transformer models for Serbian,
the first two models being named Vrabac
<https://huggingface.co/jerteh/gpt2-vrabac> and Orao
<https://huggingface.co/jerteh/gpt2-orao>. You guess which is the bigger
one. :-) We were told there will be additional models coming from that
direction and we are very much looking forward to those!

- You might have followed on social media the most productive project I
have ever seen -  the yugoGPT model
<https://www.linkedin.com/posts/aleksagordic_well-its-official-yugogpt-7b-significantly-activity-7143209223722627072-0s9Y/>
- work of Aleksa Gordić. We were happy to be able to support Aleksa at
least on the data and some discussion front. It was not easy to keep up
with that guy! Wow! We really hope this is not Aleksa's (first? and) last
HBS LLM rodeo!

We wish you calm and relaxing holidays!

The CLASSLA team
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.ijs.si/pipermail/classla/attachments/20231224/4f4f39a3/attachment.htm>