From nljubesi at gmail.com Wed May 6 10:33:52 2026 From: nljubesi at gmail.com (=?UTF-8?B?Tmlrb2xhIExqdWJlxaFpxIc=?=) Date: Wed, 6 May 2026 10:33:52 +0200 Subject: [CLASSLA] First CfP: 13th Web-as-Corpus (WaC-13) Workshop @EMNLP2026, Budapest, Hungary, 24-29 Oct, 2026 Message-ID: First Call for Papers 13th Web-as-Corpus (WaC-13) Workshop @EMNLP2026, Budapest, Hungary, 24-29 Oct, 2026 https://wacky-workshop.github.io/ The World Wide Web has evolved from a resource for building linguistic corpora into the central data infrastructure powering modern natural language processing and Large Language Models (LLMs). As web-scale data increasingly shapes AI systems’ knowledge and capabilities, understanding its quality, representativeness, and ethical implications has become critical. At the same time, the “more is better” paradigm is being challenged by issues such as machine-generated content, data toxicity, limited metadata, and the under-representation of many languages and domains. These challenges call for a shift toward Data-Centric AI, focusing on the curation, analysis, and responsible use of web-derived data. The 13th Web-as-Corpus (WaC-13) workshop provides a multidisciplinary forum for research addressing the full lifecycle of web data. We invite submissions on methods, resources, and applications related to web corpora, with special emphasis on multilingual data and less-resourced languages. Topics of interest include (but are not limited to): * Creation and evaluation of high-quality datasets for foundation models (e.g., data collection, filtering, enrichment, language identification) * Use of web data in empirical linguistic research * Analysis of web-scale corpora for quality, representativeness, and societal insights * Ethical and legal aspects of collecting, sharing, and using web data By bringing together researchers from NLP, linguistics, and the social sciences, WaC aims to advance best practices for one of the field’s most influential data sources. Important dates: Direct paper submission deadline: 7 August, 2026 Pre-reviewed ARR commitment deadline: 1 September, 2026 Notification of acceptance: 5 September, 2026 Camera-ready paper due: 20 September, 2026 Conference dates: 24-29 Oct, 2026 Submissions: Submissions will be possible through ARR commitment and through openreview.net (more details to follow on https://wacky-workshop.github.io/ ). Workshop Organizers: Nikola Ljubešić, Jožef Stefan Institute, Slovenia Yves Scherrer, University of Oslo, Norway Laurie Burchell, Common Crawl Veronika Laippala, University of Turku, Finland Pedro Ortiz Saurez, Common Crawl Jen English, Common Crawl Vuk Dinić, Jožef Stefan Institute, Slovenia -------------- next part -------------- An HTML attachment was scrubbed... URL: From Nikola.Ljubesic at ijs.si Wed May 6 13:21:14 2026 From: Nikola.Ljubesic at ijs.si (=?UTF-8?Q?Nikola_Ljube=C5=A1i=C4=87?=) Date: Wed, 06 May 2026 13:21:14 +0200 Subject: [CLASSLA] First CfP: 13th Web-as-Corpus (WaC-13) Workshop @EMNLP2026, Budapest, Hungary, 24-29 Oct, 2026 Message-ID: Dear CLASSLA people, find below a call for papers for a workshop that will take place in our viccinity this fall. The workshop aims at people collecting and processing web texts, using web texts in training LLMs, but also at researchers performing linguistic research on web texts. Having some papers that use, inter alia, the recent CLASSLA-web corpora (https://clarinsi.github.io/classla-web/) would be great! Best, Nikola First Call for Papers 13th Web-as-Corpus (WaC-13) Workshop @EMNLP2026, Budapest, Hungary, 24-29 Oct, 2026 https://wacky-workshop.github.io/. The World Wide Web has evolved from a resource for building linguistic corpora into the central data infrastructure powering modern natural language processing and Large Language Models (LLMs). As web-scale data increasingly shapes AI systems’ knowledge and capabilities, understanding its quality, representativeness, and ethical implications has become critical. At the same time, the “more is better” paradigm is being challenged by issues such as machine-generated content, data toxicity, limited metadata, and the under-representation of many languages and domains. These challenges call for a shift toward Data-Centric AI, focusing on the curation, analysis, and responsible use of web-derived data. The 13th Web-as-Corpus (WaC-13) workshop provides a multidisciplinary forum for research addressing the full lifecycle of web data. We invite submissions on methods, resources, and applications related to web corpora, with special emphasis on multilingual data and less-resourced languages. Topics of interest include (but are not limited to): * Creation and evaluation of high-quality datasets for foundation models (e.g., data collection, filtering, enrichment, language identification) * Use of web data in empirical linguistic research * Analysis of web-scale corpora for quality, representativeness, and societal insights * Ethical and legal aspects of collecting, sharing, and using web data By bringing together researchers from NLP, linguistics, and the social sciences, WaC aims to advance best practices for one of the field’s most influential data sources. Important dates Direct paper submission deadline 7 August, 2026 Pre-reviewed ARR commitment deadline 1 September, 2026 Notification of acceptance 5 September, 2026 Camera-ready paper due 20 September, 2026 Conference dates 24-29 Oct, 2026 Submissions Submissions will be possible through ARR commitment and through openreview.net (more details to follow on https://wacky-workshop.github.io/). Workshop Organizers Nikola Ljubešić, Jožef Stefan Institute, Slovenia Yves Scherrer, University of Oslo, Norway Laurie Burchell, Common Crawl Veronika Laippala, University of Turku, Finland Pedro Ortiz Saurez, Common Crawl Jen English, Common Crawl Vuk Dinić, Jožef Stefan Institute, Slovenia