[CLASSLA] "From Fringe to Infrastructure" Dataset

Nikola Ljubešić nljubesi at gmail.com
Thu Mar 21 08:47:59 CET 2024


Dear CLASSLA list participants,

I must apologize for including the list into the conversation between
Barbara and me - this was by mistake, I wanted to cc
helpdesk.classla at clarin.si, as through the CLASSLA helpdesk we are offering
the service of informing on language resources and technologies for South
Slavic languages.

Nevertheless, two good things came out of my mistake - Barbara joined the
mailing list (yaay, Barbara, stay, the list will be informative for you),
and everyone on the mailing list had the opportunity to see our helpdesk in
action (together with the necessary glitches from overworked academics).

Thank you all for your understanding,

Nikola

On Fri, Mar 15, 2024 at 12:28 PM Barbara Kovacic <
Barbara.Kovacic at campus.lmu.de> wrote:

> Dear Nikola,
>
> I talked to Mr Thakkar and also to the other group members (we are four
> people now) and we decided to use the MetaLangNEWS-COMMENTS-Hr corpus for
> our project and see what insights we can find.
>
> Nevertheless, I wanted to ask if you can ask Mr Popič (or I can contact
> him directly) for the annotation guidelines described in "From Fringe to
> infrastructure" (if there are some), as this might be a project I want to
> do later. I really like the annotation scheme he developed and there are no
> other annotation schemes for language attitude from a NLP perspective.
>
> Thank you for your help and recommendations!
>
> Best Regards,
>
> Barbara
>
> > Ursprüngliche Nachricht:
> > Von: "Nikola Ljubešić" <nljubesi at gmail.com>
> > An: Barbara Kovacic <Barbara.Kovacic at campus.lmu.de>, classla at ijs.si
> > Kopie: nikola.ljubesic at ijs.si, darja.fiser at ff.uni-lj.si
> > Datum: Thu Mar 14 14:35:32 CET 2024
> >
> > Ok, thanks, now I got the "language attitude" / LA part, it skipped on me
> > before. Before I add Damjan Popič, the third author, who authored this
> part
> > of the chapter, I want you to have a look at the following dataset:
> > http://hdl.handle.net/11356/1369 and http://hdl.handle.net/11356/1370.
> One
> > is, how I understand this, news articles, another comments on the news
> > articles, both "language-related".
> >
> > Would this be of use to you? It is an open dataset and, again, any
> > annotations done by you could be more useful than the twitter data that
> has
> > not been published (yet, it could).
> >
> > There are similar datasets in other languages available as well (Serbian,
> > Macedonian, Slovenian, Bosnian, Montenegrin).
> >
> > Nikola
> >
> > On Thu, Mar 14, 2024 at 12:34 PM Barbara Kovacic <
> > Barbara.Kovacic at campus.lmu.de> wrote:
> >
> > > Dear Nikola,
> > >
> > > thank you for your answer. I would need the dataset described in 4.2.1
> of
> > > the paper, where you filtered the dataset described in 3 based on the
> > > keywords (language, orthography, grammar, dictionary, Croatian). As we
> want
> > > to focus on the Croatian language, we just need the Croatian part of
> this
> > > data. As you annotated 750 of these tweets (50 per keyword), it would
> be
> > > great to get the language attitude (LA) / stance annotation for them.
> > >
> > > For our course, we need to annotate sentiment for at least 2000
> sentences.
> > > Depending on wether a tweet consists of one or more sentences, we
> would do
> > > the following:
> > >
> > > 1. Sentiment annotate the 750 LA/stance annotated tweets on sentence
> level
> > > 2. Analyse if there is a correlation between sentiment and the
> LA/stance
> > > 3. Sentiment annoatet other tweets based on keywords (e.g language) on
> > > sentence level
> > > 4. Analyse if there is a correlation between sentiment and keyword
> > > 5. Train and finetune transformer model based on sentiment
> > > 6. Optional: if there is still time left and annotation guidelines
> > > available, LA/stance annotate more tweets
> > >
> > > We are currently a group of two people but hope to find more colleagues
> > > who want to work on the task. As our special interest relies on
> language
> > > attitude and not specifically on social media data, we would also be
> open
> > > for non-social-media data which was annotated based on LA/stance. It is
> > > just important that we have enough sentences available which we can
> > > annotate.
> > >
> > > Thank you for your help!
> > >
> > > Best Regards,
> > >
> > > Barbara
> > >
> > > > Ursprüngliche Nachricht:
> > > > Von: "Nikola Ljubešić" <nljubesi at gmail.com>
> > > > An: Barbara Kovacic <Barbara.Kovacic at campus.lmu.de>
> > > > Kopie: nikola.ljubesic at ijs.si, darja.fiser at ff.uni-lj.si
> > > > Datum: Thu Mar 14 10:50:45 CET 2024
> > > >
> > > > Dear Barbara,
> > > >
> > > > Thanks for reaching out. Can you specify what dataset you are
> referring
> > > to?
> > > > The dataset used to identify stances towards language in the
> different
> > > > countries? This is the work done by the third co-author, but I might
> have
> > > > access to the data (would need to dig, probably also include him in
> the
> > > > discussion).
> > > >
> > > > Beyond that, if you need any social media data, it might be very
> cool if
> > > > you annotated some of the training datasets mentioned in the paper -
> > > > ReLDI-NormTagNER either in Serbian (-hr) or Croatian (-sr). These are
> > > > available on the links provided in the paper.
> > > >
> > > > These datasets are also already public, so your annotations might
> enrich
> > > a
> > > > public dataset further. Larger datasets from Twitter are not allowed
> to
> > > be
> > > > published in text form.
> > > >
> > > > Let me know,
> > > >
> > > > Nikola
> > > >
> > > > On Thu, Mar 14, 2024 at 10:43 AM Barbara Kovacic <
> > > > Barbara.Kovacic at campus.lmu.de> wrote:
> > > >
> > > > > Dear Mr Ljubešić, dear Ms Fišer,
> > > > >
> > > > > is there an update on this topic?
> > > > >
> > > > > Best Regards,
> > > > >
> > > > > Barbara Kovačić
> > > > >
> > > > > > Ursprüngliche Nachricht:
> > > > > > Von: Barbara Kovacic <Barbara.Kovacic at campus.lmu.de>
> > > > > > An: nikola.ljubesic at ijs.si, darja.fiser at ff.uni-lj.si
> > > > > > Kopie:
> > > > > > Datum: Fri Mar 08 12:42:15 CET 2024
> > > > > >
> > > > > > Dear Mr Ljubešić, dear Ms Fišer,
> > > > > >
> > > > > > I am a computational linguistics student from LMU Munich,
> currently
> > > > > doing an Erasmus
> > > > > > exchange at university of Zagreb.
> > > > > >
> > > > > > In the class "Obrada prirodnog jezika", teached by Gaurish
> Thakkar
> > > and
> > > > > Nives Mikelić
> > > > > > Preradović, our focus relies on sentiment analysis. For our final
> > > > > project  we have
> > > > > > to annotate a data set and finetune a transformer model for
> sentiment
> > > > > analysis.
> > > > > >
> > > > > > As I am currently exploring methods of how to research language
> > > > > attitudes computationally,
> > > > > > I was thinking of using the dataset you created for the South
> Slavic
> > > > > UGC, described
> > > > > > in the paper "From Fringe to Infrastructure", annotate it based
> on
> > > > > sentiment and
> > > > > > use it to finetune BERTić or CroSloAngual BERT for sentiment
> > > analysis.
> > > > > >
> > > > > > I was trying to find the dataset online as downloadable
> resource, but
> > > > > was not successfull
> > > > > > doing so. Therefore I wanted to ask, if there is a website where
> I
> > > can
> > > > > download the
> > > > > > dataset. If not, would it be possible that you give me access to
> it.
> > > As
> > > > > I am also
> > > > > > interessted in the way you annotated the language attitude
> > > categories,
> > > > > described
> > > > > > in the paper, I wanted to ask if I can get the annotated part of
> the
> > > > > dataset, and
> > > > > > if given, the annotation guidelines that you used.
> > > > > >
> > > > > > Thanks for your help in advance.
> > > > > >
> > > > > > Best Regards,
> > > > > >
> > > > > > Barbara Kovačić
> > > > >
> > > >
> > >
> > >
> > > Freundliche Grüße,
> > >
> > > Barbara Kovačić
> > >
> >
>
>
> Freundliche Grüße,
>
> Barbara Kovačić
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.ijs.si/pipermail/classla/attachments/20240321/46371ff7/attachment.htm>


More information about the CLASSLA mailing list