SIGLEX-MWE Section - 21st Workshop on Multiword Expressions (MWE 2025)

21^st Workshop on Multiword Expressions (MWE 2025)

Colocated with: NAACL-2025, Albuquerque, New Mexico, U.S.A.

Date of the Workshop: May 4, 2025

Organised and sponsored by:
The Special Interest Group on the Lexicon (SIGLEX) of the Association for Computational Linguistics (ACL), SIGLEX’s Multiword Expressions Section (SIGLEX-MWE).

@multiword

News

Contents on this page

Proceedings and video recording
Program
Keynote Speakers
Registration
Description
Submission Formats
Paper Submission and Templates
Best Paper Award
Important Dates
Organizing Committee
Program Committee
Sponsors and Support
Anti-harassment Policy
Contact

Proceedings and video recording

The proceedings are available in the ACL Anthology.

Program

Time	Session
09:15–09:30	Welcome by Workshop Chairs
09:30–10:30	Invited talk: Meaning Construction at the Syntax-Lexis Nexus by Nathan Schneider , Georgetown University
	Session chair: Atul Kr. Ojha
10:30–11:00	Coffee break
11:00–12:30	Oral Session I
	Session chair: Gražina Korvel
11:00-11:25	Survey on Lexical Resources Focused on Multiword Expressions for the Purposes of NLP Verginica Mititelu, Voula Giouli, Gražina Korvel, Chaya Liebeskind, Irina Lobzhanidze, Rusudan Makhachashvili, Stella Markantonatou, Aleksandra Markovic and Ivelina Stoyanova
11:25-11:50	Named Entity Recognition for the Irish Language Jane Adkins, Hugo Collins, Joachim Wagner, Abigail Walsh and Brian Davis
11:50-12:10	Gathering Compositionality Ratings of Ambiguous Noun-Adjective Multiword Expressions in Galician Laura Castro and Marcos Garcia
12:10-12:30	VMWE identification with models trained on GUD (a UDv.2 treebank of Standard Modern Greek) Stella Markantonatou, Vivian Stamou, Stavros Bompolas, Katerina Anastasopoulou, Irianna Linardaki Vasileiadi, Konstantinos Diamantopoulos, Yannis Kazos and Antonios Anastasopoulos
12:30–14:00	Lunch
14:00–14:50	Panel Discussion
	Panel Topic: *Tokenization in the era of LLMs*
	Panelists: Agata Savary, (Université Paris-Saclay), Elizabeth Salesky (Google DeepMind), Jindřich Libovický (UFAL, Charles University), John P. McCrae (University of Galway), Yuval Pinter (Ben-Gurion University)
	Panel Coordinator: Nathan Schneider, Georgetown University
	Group photo
14:50–15:30	Oral Session II
	Session chair: Agata Savary
14:50-15:10	Using LLMs to Advance Idiom Corpus Construction Dogukan Arslan, Hüseyin Anıl Çakmak, Gulsen Eryigit and Joakim Nivre
15:10-15:30	A European Portuguese corpus annotated for verbal idioms David Antunes, Jorge Baptista and Nuno J. Mamede
15:30–16:00	Coffee Break
16:00–17:25	Oral Session III
	Session chair: Rusudan Makhachashvili
16:00-16:25	MultiCoPIE: A Multilingual Corpus of Potentially Idiomatic Expressions for Cross-lingual PIE Disambiguation Uliana Sentsova, Debora Ciminari, Josef Van Genabith and Cristina España-Bonet
16:25-16:50	Probing Internal Representations of Multi-Word Verbs in Large Language Models Hassane Kissane, Achim Schilling and Patrick Krauss
16:50-17:15	Syntagmatic Productivity of MWEs in Scientific English Diego Alves, Stefan Fischer and Elke Teich
17:15–17:40	Community discussion, Best Paper Awards and Concluding Remarks
	Session chair: Verginica Mititelu

Keynote speaker

Nathan Schneider (Georgetown University)

Bio: Nathan Schneider is a computational linguist. As Associate Professor of Linguistics and Computer Science at Georgetown University, he leads the NERT lab, looking for synergies between practical language technologies and the scientific study of language, with an emphasis on how words, grammar, and context conspire to convey meaning. He is the recipient of an NSF CAREER award to study NLP vis-à-vis metalinguistic enterprises like language learning, linguistics, and legal interpretation. Recently, he has weighed in on specific interpretive debates in U.S. law; one of these analyses was cited by U.S. Supreme Court justices in a major firearms case. He is active in the NLP community—especially ACL’s SIGANN and SIGLEX—and the Universal Dependencies project; and cofounded the SOLID forum for empirical research on legal interpretation. Prior to Georgetown, he inhabited UC Berkeley, Carnegie Mellon University, and the University of Edinburgh. Apart from annotation scheming and computational modeling, he enjoys classical music and chocolate chip cookies

Title: Meaning Construction at the Syntax-Lexis Nexus

Abstract: When words and grammar come into contact, things sometimes get messy: idiosyncratic expressions and patterns disobey ordinary principles of regularity and compositionality. A useful point of reference is the theoretical perspective of Construction Grammar, which exhorts us to view linguistic knowledge in terms of form-function mappings—at all levels of granularity. How can this perspective inform a broad-coverage, multilingual approach to lexicosyntactic conundrums? First, I will discuss implications for corpus annotation: while some multiword expressions and names (e.g. “at least”, “in order to”, “Chapter 1”) test the limits of categorical annotation standards like Universal Dependencies, UD treebanks nevertheless enable empirical investigation of some functionally-defined constructions across languages. Second, I will discuss efforts to interpret the latent representations of constructional form and meaning in transformer language models, with the NPN construction (noun-preposition-noun, as in “face to face”) as a case study.

Registration

To attend the workshop (either in person or virtually), please register through NAACL 2025’s registration system. Note that to attend MWE 2025, it is sufficient to select this workshop during registration; you do not have to register for the main conference.

Description

Multiword expressions (MWEs), i.e., word combinations that exhibit lexical, syntactic, semantic, pragmatic, and/or statistical idiosyncrasies (Baldwin and Kim, 2010), such as “by and large”, “hot dog”, “make a decision” and “break one’s leg” are still a pain in the neck for Natural Language Processing (NLP). The notion encompasses closely related phenomena: idioms, compounds, light-verb constructions, phrasal verbs, rhetorical figures, collocations, institutionalized phrases, etc. Given their irregular nature, MWEs often pose complex problems in linguistic modeling (e.g. annotation), NLP tasks (e.g. parsing), and end-user applications (e.g. natural language understanding and Machine Translation), hence still representing an open issue for computational linguistics (Constant et al., 2017).

For more than two decades, modelling and processing MWEs for NLP has been the topic of the MWE workshop organised by the MWE section of SIGLEX in conjunction with major NLP conferences since 2003. Impressive progress has been made in the field, but our understanding of MWEs still requires much research considering their need and usefulness in NLP applications. This is also relevant to domain-specific NLP pipelines that need to tackle terminologies most often realised as MWEs. Following previous years, for this 21^st edition of the workshop, we identified the following topics on which contributions are particularly encouraged:

MWE processing to enhance end-user applications: MWEs gained particular attention in end-user applications, including Machine Translation (MT) (Zaninello and Birch, 2020), simplification (Kochmar et al., 2020), language learning and assessment (Paquot et al., 2020), social media mining (Pelosi et al., 2017), and abusive language detection (Zampieri et al. 2020). We believe that it is crucial to extend and deepen these first attempts to integrate and evaluate MWE technology in these and further end-user applications.
MWE processing and identification in the general language, as well as in specialized languages and domains: Multiword terminology extraction from domain-specific corpora (Lossio-Ventura et al, 2014) is of particular importance to various applications, such as MT (Semmar and Laib, 2017), or for the identification and monitoring of neologisms and technical jargon (Chatzitheodorou and Kappatos, 2021).
MWE processing in low-resource languages: The PARSEME shared tasks (2017, 2018, 2020) among others, have fostered significant progress in MWE identification, providing datasets that include low-resource languages, evaluation measures, and tools that now allow fully integrating MWE identification into end-user applications. There are continuous efforts in this direction (Diaz Hernandez, 2024) and a few of them have also explored methods for the automatic interpretation of MWEs (Bhatia et al., 2018), and their processing in low-resource languages (Eder et al., 2021). Resource creation and sharing should be pursued in parallel with the development of multilingual benchmarks for MWE identification (Savary et al., 2023).
MWE identification and interpretation in LLMs: Most current MWE processing is limited to their identification and detection using pre-trained language models, but we still lack understanding about how MWEs are represented and dealt with therein (Garcia et al., 2021), how to better model the compositionality of MWEs from semantics (Phelps et al., 2024). Now that NLP has shifted towards end-to-end neural models like BERT, capable of solving complex tasks with little or no intermediary linguistic symbols, questions arise about the extent to which MWEs should be implicitly or explicitly modelled (Shwartz and Dagan, 2019).
New and enhanced representation of MWEs in language resources and computational models of compositionality as gold standards for formative intrinsic evaluation.

Through this workshop, we will bring together and encourage researchers in various NLP subfields to submit their MWE-related research . We also intend to consolidate the converging results of previous joint workshops LAW-MWE-CxG 2018, MWE-WN 2019 and MWE-LEX 2020, the joint MWE-WOAH panel in 2021, the MWE-SIGUL 2022 joint session, and the MWE-UD 2024, extending our scope to MWEs in e-lexicons and WordNets, MWE annotation, as well as grammatical constructions. Correspondingly, we call for papers on research related (but not limited) to MWEs and constructions in:

Computationally-applicable theoretical work in psycholinguistics and corpus linguistics;
Annotation (expert, crowdsourcing, automatic) and representation in resources such as corpora, treebanks, e-lexicons, WordNets, constructions (also for low-resource languages);
Processing in syntactic and semantic frameworks (e.g. CCG, CxG, HPSG, LFG, TAG, UD, etc.);
Discovery and identification methods, including for specialized languages and domains such as clinical or biomedical NLP;
Interpretation of MWEs and understanding of text containing them;
Language acquisition, language learning, and non-standard language (e.g. tweets, speech);
Evaluation of annotation and processing techniques;
Retrospective comparative analyses from the PARSEME shared tasks;
Processing for end-user applications (e.g. MT, NLU, summarisation, language learning, etc.);
Implicit and explicit representation in pre-trained language models and end-user applications;
Evaluation and probing of pre-trained language models;
Resources and tools (e.g. lexicons, identifiers) and their integration into end-user applications;
Multiword terminology extraction;
Adaptation and transfer of annotations and related resources to new languages and domains including low-resource ones.

Submission Formats

The workshop invites two types of submissions: 

Archival submissions that present substantially original research in both long paper format (8 pages + references) and short paper format (4 pages + references)
Non-archival submissions of abstracts describing relevant research presented/published elsewhere which will not be included in the MWE proceedings.

Paper Submission and Templates

Papers should be submitted via the OpenReview submission page. Please choose the appropriate submission format (archival/non-archival). Archival papers with existing reviews will also be accepted through the ACL Rolling Review. Submissions must follow the ACL stylesheet. For further information on this initiative, please refer to NAACL 2025

The ARR (pre-reviewed)’s paper can be committed here.

Best Paper Award

Uliana Sentsova, Debora Ciminari, Josef van Genabith, Cristina España-Bonet - MultiCoPIE: A Multilingual Corpus of Potentially Idiomatic Expressions for Cross-lingual PIE Disambiguation

Important Dates

What	When
Paper submission deadline	February 13, 2025
ARR commitment deadline	February 27, 2025
Notification of acceptance	March 8, 2025
Camera-ready papers due	March 17, 2025
Underline upload deadline	April 8, 2025
Workshop	May 04, 2025

All deadlines are at 23:59 UTC-12 (Anywhere on Earth).

Organizing Committee (Listed alphabetically)

A. Seza Doğruöz	Ghent University, Belgium
Alexandre Rademaker	FGV/EMA, Brazil
Atul Kr. Ojha	Insight Research Ireland Centre for Data Analytics, University of Galway
Gražina Korvel	VU Institute of Data Science and Digital Technologies
Mathieu Constant	Université de Lorraine
Verginica Barbu Mititelu	Romanian Academy Research Institute for Artificial Intelligence
Voula Giouli	Institute for Language & Speech Processing, ATHENA RC, Greece

Program Committee

Agata Savary	Université Paris-Saclay
Beata Trawinski	Leibniz Institute for the German Language
Carlos Ramisch	LIS - Laboratoire d’Informatique et Systèmes
Chikara Hashimoto	Rakuten Institute of Technology
Cvetana Krstev	University of Belgrade, Faculty of Philology
Eric G C Laporte	Université Gustave Eiffel
Francis Bond	Palacký University Olomouc
Gaël Dias	University of Caen Normandy
Gražina Korvel	Vilnius University
Irina Lobzhanidze	Ilia Chavchavadze State University
Ismail El Maarouf	Imprevicible
Ivelina Stoyanova	Deaf Studies Institute
Jan Odijk	Utrecht University
John Philip McCrae	National University of Ireland Galway
Kenneth Church	Northeastern University
Manfred Sailer	Johann Wolfgang Goethe Universität Frankfurt am Main
Mathieu Constant	Université de Lorraine, CNRS, ATILF
Matthew Shardlow	The Manchester Metropolitan University
Meghdad Farahmand	University of Genoa
Miriam Butt	Universität Konstanz
Paul Cook	University of New Brunswick
Pavel Pecina	Charles University
Petya Osenova	Sofia University “St. Kliment Ohridski”
Ranka Stanković	University of Belgrade
Sabine Schulte im Walde	University of Stuttgart
Shiva Taslimipoor	University of Cambridge
Stan Szpakowicz	University of Ottawa
Stella Markantonatou	ATHENA RIC
Tiberiu Boros	Adobe Systems
Tunga Gungor	Bogazici University