[CLASSLA] First CfP: 13th Web-as-Corpus (WaC-13) Workshop @EMNLP2026, Budapest, Hungary, 24-29 Oct, 2026

Nikola Ljubešić Nikola.Ljubesic at ijs.si
Wed May 6 13:21:14 CEST 2026


Dear CLASSLA people, find below a call for papers for a workshop that 
will take place in our viccinity this fall. The workshop aims at people 
collecting and processing web texts, using web texts in training LLMs, 
but also at researchers performing linguistic research on web texts. 
Having some papers that use, inter alia, the recent CLASSLA-web corpora 
(https://clarinsi.github.io/classla-web/) would be great!

Best,

Nikola


First Call for Papers

13th Web-as-Corpus (WaC-13) Workshop @EMNLP2026, Budapest, Hungary, 
24-29 Oct, 2026

https://wacky-workshop.github.io/.

The World Wide Web has evolved from a resource for building linguistic 
corpora into the central data infrastructure powering modern natural 
language processing and Large Language Models (LLMs). As web-scale data 
increasingly shapes AI systems’ knowledge and capabilities, 
understanding its quality, representativeness, and ethical implications 
has become critical.

At the same time, the “more is better” paradigm is being challenged by 
issues such as machine-generated content, data toxicity, limited 
metadata, and the under-representation of many languages and domains. 
These challenges call for a shift toward Data-Centric AI, focusing on 
the curation, analysis, and responsible use of web-derived data.

The 13th Web-as-Corpus (WaC-13) workshop provides a multidisciplinary 
forum for research addressing the full lifecycle of web data. We invite 
submissions on methods, resources, and applications related to web 
corpora, with special emphasis on multilingual data and less-resourced 
languages.

Topics of interest include (but are not limited to):

* Creation and evaluation of high-quality datasets for foundation models 
(e.g., data collection, filtering, enrichment, language identification)
* Use of web data in empirical linguistic research
* Analysis of web-scale corpora for quality, representativeness, and 
societal insights
* Ethical and legal aspects of collecting, sharing, and using web data

By bringing together researchers from NLP, linguistics, and the social 
sciences, WaC aims to advance best practices for one of the field’s most 
influential data sources.

Important dates

Direct paper submission deadline
7 August, 2026

Pre-reviewed ARR commitment deadline
1 September, 2026

Notification of acceptance
5 September, 2026

Camera-ready paper due
20 September, 2026

Conference dates
24-29 Oct, 2026

Submissions

Submissions will be possible through ARR commitment and through 
openreview.net (more details to follow on 
https://wacky-workshop.github.io/).

Workshop Organizers

Nikola Ljubešić, Jožef Stefan Institute, Slovenia
Yves Scherrer, University of Oslo, Norway
Laurie Burchell, Common Crawl
Veronika Laippala, University of Turku, Finland
Pedro Ortiz Saurez, Common Crawl
Jen English, Common Crawl
Vuk Dinić, Jožef Stefan Institute, Slovenia


More information about the CLASSLA mailing list