[CLASSLA] Some last-minute holiday greetings

Mon Dec 25 10:44:02 CET 2023

Nikola it was a huge pleasure to work with you! Thank you for the kind
words, we're just getting started! :))

I'm still in crunch mode but you might be able to play with YugoGPT already
a bit later today.

Btw, I had a call with Andrija who's fine-tuning Whisper, his story is
quite interesting as he works in a library and he's a philosopher "by
trade". He picked up ML along the way by going through a HuggingFace course.

This is a really cool report, thanks for including me.

-Aleksa

On Sun, Dec 24, 2023 at 1:58 PM Nikola Ljubešić <nljubesi at gmail.com> wrote:

> This year has been extremely packed with activities, hence this very
> last-minute cheer - we are on a good track to become much less of a
> less-resourced language family! I give a few examples that come to mind
> first.
>
> - Macedonian has arrived to Universal Dependencies 🥳
> <https://universaldependencies.org/treebanks/mk_mtb/index.html>🥳🥳
> thanks to Vladimir Cvetkoski, this may be "only" 155 sentences and 1.360
> tokens, but, hey - it is infinitely more than there was before. Bravo,
> Vladimir!
>
> - CLASSLA followed the great example of Vladimir and decided to publish
> SETimes.MK <http://hdl.handle.net/11356/1886> in its current status as
> version 0.1, 570 sentences and 13.310 tokens in size, annotated on XPOS,
> UPOS, FEATS and LEMMA level, to give additional momentum to the positively
> developing situation for Macedonian.
>
> - In Slovenia the PoVeJMo project <https://www.cjvt.si/povejmo/> has
> started, focused on adapting an LLM to Slovenian language in general, as
> well as adapting it to a series of industrial use cases.
>
> - Andrija Sagić, a multimedia enthusiast, is seriously biting in the
> speech apple, additionally fine-tuning the really great whisper-large-v3
> model <https://huggingface.co/Sagicc/whisper-large-v3-sr-cmb> on all the
> data he can scrape together for Serbian, which mostly includes our Južne
> Vesti dataset <http://hdl.handle.net/11356/1679>. We are now working with
> Andrija on improving the dataset (quite many typos in the human
> transcript!) and are looking forward to jointly publishing a version 2.0.
> This is the type of collaboration we are very much in need of!
>
> - The ReLDI team has started, together with ICEF, Belgrade, the
> industry-funded (you do not see many of those!) ComText.SR project
> <https://icef-nlp.github.io/COMtext.SR/> on collecting, curating,
> annotating and publicly releasing textual data for various domains of
> special interest to the industry.
>
> - The JeRTeh society has started publishing transformer models for
> Serbian, the first two models being named Vrabac
> <https://huggingface.co/jerteh/gpt2-vrabac> and Orao
> <https://huggingface.co/jerteh/gpt2-orao>. You guess which is the bigger
> one. :-) We were told there will be additional models coming from that
> direction and we are very much looking forward to those!
>
> - You might have followed on social media the most productive project I
> have ever seen -  the yugoGPT model
> <https://www.linkedin.com/posts/aleksagordic_well-its-official-yugogpt-7b-significantly-activity-7143209223722627072-0s9Y/>
> - work of Aleksa Gordić. We were happy to be able to support Aleksa at
> least on the data and some discussion front. It was not easy to keep up
> with that guy! Wow! We really hope this is not Aleksa's (first? and) last
> HBS LLM rodeo!
>
> We wish you calm and relaxing holidays!
>
> The CLASSLA team
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.ijs.si/pipermail/classla/attachments/20231225/fab51937/attachment.htm>