[CLASSLA] "From Fringe to Infrastructure" Dataset

Nikola Ljubešić nljubesi at gmail.com
Thu Mar 14 14:35:16 CET 2024


Ok, thanks, now I got the "language attitude" / LA part, it skipped on me
before. Before I add Damjan Popič, the third author, who authored this part
of the chapter, I want you to have a look at the following dataset:
http://hdl.handle.net/11356/1369 and http://hdl.handle.net/11356/1370. One
is, how I understand this, news articles, another comments on the news
articles, both "language-related".

Would this be of use to you? It is an open dataset and, again, any
annotations done by you could be more useful than the twitter data that has
not been published (yet, it could).

There are similar datasets in other languages available as well (Serbian,
Macedonian, Slovenian, Bosnian, Montenegrin).

Nikola

On Thu, Mar 14, 2024 at 12:34 PM Barbara Kovacic <
Barbara.Kovacic at campus.lmu.de> wrote:

> Dear Nikola,
>
> thank you for your answer. I would need the dataset described in 4.2.1 of
> the paper, where you filtered the dataset described in 3 based on the
> keywords (language, orthography, grammar, dictionary, Croatian). As we want
> to focus on the Croatian language, we just need the Croatian part of this
> data. As you annotated 750 of these tweets (50 per keyword), it would be
> great to get the language attitude (LA) / stance annotation for them.
>
> For our course, we need to annotate sentiment for at least 2000 sentences.
> Depending on wether a tweet consists of one or more sentences, we would do
> the following:
>
> 1. Sentiment annotate the 750 LA/stance annotated tweets on sentence level
> 2. Analyse if there is a correlation between sentiment and the LA/stance
> 3. Sentiment annoatet other tweets based on keywords (e.g language) on
> sentence level
> 4. Analyse if there is a correlation between sentiment and keyword
> 5. Train and finetune transformer model based on sentiment
> 6. Optional: if there is still time left and annotation guidelines
> available, LA/stance annotate more tweets
>
> We are currently a group of two people but hope to find more colleagues
> who want to work on the task. As our special interest relies on language
> attitude and not specifically on social media data, we would also be open
> for non-social-media data which was annotated based on LA/stance. It is
> just important that we have enough sentences available which we can
> annotate.
>
> Thank you for your help!
>
> Best Regards,
>
> Barbara
>
> > Ursprüngliche Nachricht:
> > Von: "Nikola Ljubešić" <nljubesi at gmail.com>
> > An: Barbara Kovacic <Barbara.Kovacic at campus.lmu.de>
> > Kopie: nikola.ljubesic at ijs.si, darja.fiser at ff.uni-lj.si
> > Datum: Thu Mar 14 10:50:45 CET 2024
> >
> > Dear Barbara,
> >
> > Thanks for reaching out. Can you specify what dataset you are referring
> to?
> > The dataset used to identify stances towards language in the different
> > countries? This is the work done by the third co-author, but I might have
> > access to the data (would need to dig, probably also include him in the
> > discussion).
> >
> > Beyond that, if you need any social media data, it might be very cool if
> > you annotated some of the training datasets mentioned in the paper -
> > ReLDI-NormTagNER either in Serbian (-hr) or Croatian (-sr). These are
> > available on the links provided in the paper.
> >
> > These datasets are also already public, so your annotations might enrich
> a
> > public dataset further. Larger datasets from Twitter are not allowed to
> be
> > published in text form.
> >
> > Let me know,
> >
> > Nikola
> >
> > On Thu, Mar 14, 2024 at 10:43 AM Barbara Kovacic <
> > Barbara.Kovacic at campus.lmu.de> wrote:
> >
> > > Dear Mr Ljubešić, dear Ms Fišer,
> > >
> > > is there an update on this topic?
> > >
> > > Best Regards,
> > >
> > > Barbara Kovačić
> > >
> > > > Ursprüngliche Nachricht:
> > > > Von: Barbara Kovacic <Barbara.Kovacic at campus.lmu.de>
> > > > An: nikola.ljubesic at ijs.si, darja.fiser at ff.uni-lj.si
> > > > Kopie:
> > > > Datum: Fri Mar 08 12:42:15 CET 2024
> > > >
> > > > Dear Mr Ljubešić, dear Ms Fišer,
> > > >
> > > > I am a computational linguistics student from LMU Munich, currently
> > > doing an Erasmus
> > > > exchange at university of Zagreb.
> > > >
> > > > In the class "Obrada prirodnog jezika", teached by Gaurish Thakkar
> and
> > > Nives Mikelić
> > > > Preradović, our focus relies on sentiment analysis. For our final
> > > project  we have
> > > > to annotate a data set and finetune a transformer model for sentiment
> > > analysis.
> > > >
> > > > As I am currently exploring methods of how to research language
> > > attitudes computationally,
> > > > I was thinking of using the dataset you created for the South Slavic
> > > UGC, described
> > > > in the paper "From Fringe to Infrastructure", annotate it based on
> > > sentiment and
> > > > use it to finetune BERTić or CroSloAngual BERT for sentiment
> analysis.
> > > >
> > > > I was trying to find the dataset online as downloadable resource, but
> > > was not successfull
> > > > doing so. Therefore I wanted to ask, if there is a website where I
> can
> > > download the
> > > > dataset. If not, would it be possible that you give me access to it.
> As
> > > I am also
> > > > interessted in the way you annotated the language attitude
> categories,
> > > described
> > > > in the paper, I wanted to ask if I can get the annotated part of the
> > > dataset, and
> > > > if given, the annotation guidelines that you used.
> > > >
> > > > Thanks for your help in advance.
> > > >
> > > > Best Regards,
> > > >
> > > > Barbara Kovačić
> > >
> >
>
>
> Freundliche Grüße,
>
> Barbara Kovačić
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.ijs.si/pipermail/classla/attachments/20240314/973320a3/attachment.htm>


More information about the CLASSLA mailing list