Joint Workshop on Multiword Expressions and Universal Dependencies (MWE-UD 2024)

Colocated with: LREC-COLING 2024 (Torino, Italia)

Date of the Workshop: May 25, 2024

Organised and sponsored by:
The Special Interest Group on the Lexicon (SIGLEX) of the Association for Computational Linguistics (ACL), SIGLEX’s Multiword Expressions Section (SIGLEX-MWE), Universal Dependencies (UD) and UniDive Cost Action CA21167.


News


Contents on this page

Keynote Speakers

Natalia Levshina: Using Universal Dependencies for testing hypotheses about communicative efficiency

Abstract: There is abundant evidence that language structure and use are influenced by language users’ tendency to be efficient, trying to minimize the cost-to-benefit ratio of communication (e.g., Hawkins, 2004; Gibson et al., 2019; Levshina, 2022). In my talk I will show how data from corpora annotated with Universal Dependencies can be used for testing hypotheses about the role of communicative efficiency in shaping up language structure and use. The hypotheses are as follows:

  1. As discussed by typologists (Sapir, 1921; Sinnemäki, 2008), rigid word order can compensate for lack of formal marking of core arguments. The hypothesis is then that there are positive correlations between the entropy of subject and object in a transitive clause in a corpus and the relative frequency of disambiguating case forms or verb forms. These correlations are expected to minimize the articulation effort involved in the use of argument flags or indices.

  2. There is a positive correlation between semantic tightness (Hawkins, 1986), operationalized as Mutual Information between lexemes and syntactic roles, and the relative frequency of verb-final clauses in a corpus. Strong associations between lexemes and roles help to avoid the costs of reanalysis in verb-final languages.

  3. There is a negative correlation between the relative frequency of verb-final clauses in the clause and the average number of overt core arguments, which helps to save processing costs required for keeping longer dependencies in mind (cf. Ueno & Polinsky, 2009).

These hypotheses will be tested on corpus data annotated with Universal Dependencies, with the help of mixed-effects models with genealogical and geographic information as random effects.

References

Gibson, Edward, Richard Futrell, Steven P. Piantadosi, Isabelle Dautriche, Kyle Mahowald, Leon Bergen & Roger Levy. 2019. How efficiency shapes human language. *Trends in Cognitive Science* 23(5): 389-407. https://doi.org/10.1016/j.tics.2019.02.003

Hawkins, John A. 1986. *A Comparative Typology of English and German: Unifying the Contrasts*. London: Croom-Helm.

Hawkins, John A. 2004. *Efficiency and Complexity in Grammars*. Oxford: Oxford University Press.

Levshina, Natalia. 2022. *Communicative Efficiency: Language Structure and Use*. Cambridge: Cambridge University Press.

Sapir, Edward. 1921. *Language: An Introduction to the Study of Speech*. New York: Harcourt.

Sinnemäki, Kaius. 2008. Complexity trade-offs in core argument marking. In: Matti Miestamo, Kaius Sinnemäki and Fred Karlsson (eds.), *Language Complexity: Typology, Contact, Change*, 67–88. Amsterdam: John Benjamins.

Ueno, Mieko & Maria Polinsky. 2009. Does headedness affect processing? A new look at the VO-OV contrast. *Journal of Linguistics* 45: 675–710.

Bio:

Harish Tayyar Madabushi: Every Time We Hire an LLM, the Reasoning Performance of the Linguists Goes Up

Abstract: Pre-Trained Language Models (PLMs), trained on the cloze-like task of masked language modelling, have demonstrated access to a broad range of linguistic information, including both syntax and semantics. Given their access to both syntax and semantics, coupled with their data-driven foundations, which align with usage-based theories, it is valuable and interesting to examine the constructional information they encode. Early work confirmed that these models have access to a substantial amount of constructional information. However, more recent research focusing on the types of constructions PLMs can accurately interpret, and those they find challenging, suggests that an increase in schematicity correlates with a decline in model proficiency. Crucially, schematicity—the extent to which constructional slots are fixed or allow for a range of elements that satisfy a particular semantic role associated with the slot—correlates to the extent of “reasoning” needed to interpret constructions, a task that poses significant challenges for language models. In this talk, I will begin by reviewing the constructional information encoded in both earlier models and more recent large language models. I will explore how these aspects are intertwined with the models’ reasoning abilities and introduce promising new approaches that could integrate theoretical insights from linguistics with practical, data-driven approaches of PLMs.

Bio: Dr. Tayyar Madabushi’s research focuses on understanding the fundamental mechanisms that underpin the performance and functioning of Large Language Models. His work on LLMs was included in the discussion paper on the Capabilities and Risks of Frontier AI, which was used as one of the foundational research works for discussions at the UK AI Safety Summit held at Bletchley Park. His research on the constructional information encoded in language models has been influential in bringing together the fields of construction grammar and pre-trained language models. In addition, his work on language models includes collaborative industrial research aimed at rectifying biases in speech-to-text systems widely utilised across the UK. Before starting his PhD in automated question answering at the University of Birmingham, Dr. Tayyar Madabushi founded and headed a social media data analytics company based in Singapore.


Accepted papers

Long Papers

MWE

Assessing BERT’s sensitivity to idiomaticity
Li Liu and Francois Lareau

Identification and Annotation of Body Part Multiword Expressions in Old Egyptian
Roberto Díaz Hernández

Lexicons Gain the Upper Hand in Arabic MWE Identification
Najet Hadj Mohamed, Agata Savary, Cherifa Ben Khelil, Jean-Yves Antoine, Iskandar keskes and Lamia Hadrich-Belguith

Revisiting VMWEs in Hindi: Annotating Layers of Predication
Kanishka Jain and Ashwini Vaidya

Towards the semantic annotation of SR-ELEXIS corpus: Insights into Multiword Expressions and Named Entities
Cvetana Krstev, Ranka Stanković, Aleksandra M. Marković and Teodora Sofija Mihajlov

To Leave No Stone Unturned: Annotating Verbal Idioms in the Parallel Meaning Bank
Rafael Ehren, Kilian Evang and Laura Kallmeyer

Sign of the Times: Evaluating the use of Large Language Models for Idiomaticity Detection
Dylan Phelps, Thomas M. R. Pickard, Maggie Mi, Edward Gow-Smith and Aline Villavicencio

BERT-based Idiom Identification using Language Translation and Word Cohesion
Arnav Yayavaram, Siddharth Yayavaram, Prajna Devi Upadhyay and Apurba Das

UD

Automatic Manipulation of Training Corpora to Make Parsers Accept Real-world Text
Hiroshi Kanayama, Ran Iwamoto, Masayasu Muraoka, Takuya Ohko and Kohtaroh Miyamoto

Universal Feature-based Morphological Trees
Federica Gamba, Abishek Stephen and Zdeněk Žabokrtský

Universal Dependencies for Saraiki
Meesum Alam, Francis Tyers, Emily Hanink and Sandra Kübler

Strategies for the Annotation of Pronominalised Locatives in Turkic Universal Dependency Treebanks
Jonathan Washington, Çağrı Çöltekin, Furkan Akkurt, Bermet Chontaeva, Soudabeh Eslami, Gulnura Jumalieva, Aida Kasieva, Aslı Kuzgun, Büşra Marşan and Chihiro Taguchi

MWE+UD

Fitting Fixed Expressions into the UD Mould: Swedish as a Use Case
Lars Ahrenberg

Part-of-Speech Tagging for Northern Kurdish
Peshmerge Morad, Sina Ahmadi and Lorenzo Gatti

Combining Grammatical and Relational Approaches. A Hybrid Method for the Identification of Candidate Collocations from Corpora
Damiano Perri, Irene Fioravanti, Osvaldo Gervasi and Stefania Spina

Annotation of Multiword Expressions in the SUK 1.0 Training Corpus of Slovene: Lessons Learned and Future Steps
Jaka Čibej, Polona Gantar and Mija Bon

Light Verb Constructions in Universal Dependencies for South Asian Languages
Abishek Stephen and Daniel Zeman

Ad Hoc Compounds for Stance Detection
Qi Yu, Fabian Schlotterbeck, Henning Wang, Naomi Reichmann, Britta Stolterfoht, Regine Eckardt and Miriam Butt

Short Papers

MWE

A demonstration of MWE-Finder and MWE-Annotator
Jan Odijk, Martin Kroon, Tijmen Baarda, Ben Bonfil and Sheean Spoel

Annotating Compositionality Scores for Irish Noun Compounds is Hard Work
Abigail Walsh, Teresa Clifford, Emma Daly, Jane Dunne, Brian Davis and Gearóid Ó Cleircín

UD

Synthetic-Error Augmented Parsing of Swedish as a Second Language: Experiments with Word Order
Arianna Masciolini, Emilie Francis and Maria Irena Szawerna

A Universal Dependencies Treebank for Gujarati
Mayank Jobanputra, Maitrey Mehta and Çağrı Çöltekin

Overcoming Early Saturation on Low-Resource Languages in Multilingual Dependency Parsing
Jiannan Mao, Chenchen Ding, Hour Kaing, Hideki Tanaka, Masao Utiyama and Tadahiro Matsumoto.

Domain-Weighted Batch Sampling for Neural Dependency Parsing
Jacob Striebel, Daniel Dakota and Sandra Kübler

Redefining Syntactic and Morphological Tasks for Typologically Diverse Languages
Omer Goldman, Leonie Weissweiler and Reut Tsarfaty

MWE+UD

The Vedic Compound Dataset
Sven Sellmer and Oliver Hellwig

Diachronic Analysis of Multi-word Expression Functional Categories in Scientific English
Diego Alves, Stefania Degaetano-Ortlieb, Elena Schmidt and Elke Teich

Multiword Expressions between the Corpus and the Lexicon: Universality, Idiosyncrasy, and the Lexicon-Corpus Interface
Verginica Barbu Mititelu, Voula Giouli, Kilian Evang, Daniel Zeman, Petya Osenova, Carole Tiberius, Simon Krek, Stella Markantonatou, Ivelina Stoyanova, Ranka Stanković and Christian Chiarcos

Nonarchival Presentations

These papers are either already published elsewhere or work in progress. They have not undergone MWE-UD 2024’s formal peer review progress and are not included in the proceedings, only listed in the programme. Some nonarchival presentations were included in the programme after the proceedings were completed, and are thus not listed in the version of the programme that is included in the proceedings.

MWE

MultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel Corpora
Lifeng Han

AlphaMWE-Arabic: Arabic Edition of Multilingual Parallel Corpora with Multiword Expression Annotations
Najet Hadj Mohamed, Malak Rassem, Lifeng Han and Goran Nenadic

A demonstration of MWE-Finder and MWE-Annotator
Jan Odijk, Martin Kroon, Tijmen Baarda, Ben Bonfil and Sheean Spoel

Is Less More? Quality, Quantity and Context in Idiom Processing with Natural Language Models
Agne Knietaite, Adam Allsebrook, Anton Minkov, Adam Tomaszewski, Norbert Slinko, Richard Johnson, Thomas M. R. Pickard and Aline Villavicencio

Annotating Compositionality Scores for Irish Noun Compounds is Hard Work
Abigail Walsh, Teresa Clifford, Emma Daly, Jane Dunne, Brian Davis and Gearóid Ó Cleircín

UD

MaiBaam: A Multi-Dialectal Bavarian Universal Dependency Treebank
Verena Blaschke, Barbara Kovačić, Siyao Peng, Hinrich Schütze and Barbara Plank

Redefining Syntactic and Morphological Tasks for Typologically Diverse Languages
Omer Goldman, Leonie Weissweiler and Reut Tsarfaty

Sparse Logistic Regression with High-order Features for Automatic Grammar Rule Extraction from Treebanks
Santiago Herrera, Caio Corro and Sylvain Kahane

Joint Annotation of Morphology and Syntax in Dependency Treebanks
Bruno Guillaume, Kim Gerdes, Kirian Guiller, Sylvain Kahane and Yixuan Li

MWE+UD

A Corpus of Persian Sentences Annotated with Verbal Multiword Expressions: Development and Guidelines
Vahide Tajalli, Yaldasadat Yarandi, Mahtab sarlak, Mehrnoush Shamsfard and Arezoo Haghbin

UCxn: Typologically Informed Annotation of Constructions Atop Universal Dependencies
Leonie Weissweiler, Nina Böbel, Kirian Guiller, Santiago Herrera, Wesley Scivetti, Arthur Lorenzi, Nurit Melnik, Archna Bhatia, Hinrich Schütze, Lori Levin, Amir Zeldes, Joakim Nivre, William Croft and Nathan Schneider


Description

Multiword expressions (MWEs) are word combinations that exhibit lexical, syntactic, semantic, pragmatic, and/or statistical idiosyncrasies (Baldwin and Kim, 2010), such as by and large, hot dog, pay a visit and pull someone’s leg. The notion encompasses closely related phenomena: idioms, compounds, light-verb constructions, phrasal verbs, rhetorical figures, collocations, institutionalized phrases, etc. Their behavior is often unpredictable; for example, their meaning often does not result from the direct combination of the meanings of their parts. Given their irregular nature, MWEs often pose complex problems in linguistic modeling (e.g. annotation), NLP tasks (e.g. parsing), and end-user applications (e.g. natural language understanding and MT), hence still representing an open issue for computational linguistics (Constant et al., 2017).

Universal Dependencies (UD; De Marneffe et al., 2021) is a framework for cross-linguistically consistent treebank annotation that has so far been applied to over 100 languages. The framework aims to capture similarities as well as idiosyncrasies among typologically different languages (e.g., morphologically rich languages, pro-drop languages, and languages featuring clitic doubling). The goal in developing UD was not only to support comparative evaluation and cross-lingual learning but also to facilitate multilingual natural language processing and enable comparative linguistic studies.

After independently running a successful series of workshops, the MWE and UD communities are now joining forces to organize a joint workshop. This is a timely collaboration because the two communities clearly have overlapping interests. For instance, while UD has several dependency relations that can be used to annotate MWEs, both annotation guidelines (i.e. is syntactic irregularity and inflexibility or semantic non-compositionality the leading criterion?) and annotation practice (both across treebanks for a single language and across languages) for these relations can be improved (Schneider and Zeldes, 2021). The PARSEME MWE-annotated corpora for 26 languages build on UD annotated corpora (Savary et al., 2023). Both communities share an interest in developing guidelines, data-sets, and tools that can be applied to a wide range of typologically diverse languages, raising fundamental questions about tokenization, lemmatization, and morphological decomposition of tokens. Proposals for harmonizing annotation practice between what has been achieved in PARSEME and UD and expanding PARSEME MWE annotation to non-verbal MWEs are also central to the recently started UniDive COST action (CA21167).

The workshop invites submissions of original research on MWE, UD, and the interplay of both. In particular, the following topics are especially relevant:


Submission Formats

The workshop invites two types of submissions:



Paper Submission and Templates

Papers should be submitted via the workshop’s START submission page. Please choose the appropriate submission format (archival/non-archival). Submissions must follow the LREC-COLING 2024 stylesheet.

When submitting a paper from the START page, authors will be asked to provide essential information about resources (in a broad sense, i.e. also technologies, standards, evaluation kits, etc.) that have been used for the research described in the paper or are a new result of the research. Moreover, ELRA encourages all LREC-COLING authors to share the described LRs (data, tools, services, etc.) to enable their reuse and replicability of experiments (including evaluation ones)

Archival papers with existing reviews from ACL Rolling Review (ARR)) will also be considered. A paper may not be under review through ARR and MWE-UD simultaneously. A paper that has or will receive reviews through ARR may not be submitted for review to MWE-UD.


Best Paper Award and Travel Grants


Important Dates

What When
Paper submission deadline March 3, 2024
ARR commitment deadline March 25, 2024
Notification of acceptance April 1, 2024
Camera-ready papers due April 8, 2024
Underline upload deadline TBD
Workshop May 25, 2024

All deadlines are at 23:59 UTC-12 (Anywhere on Earth).


Organizing Committee

Archna Bhatia Institute for Human and Machine Cognition, USA
Gosse Bouma Groningen University, NL
A. Seza Dogruöz Ghent University, Belgium
Kilian Evang Heinrich Heine University Düsseldorf, DE
Marcos Garcia University of Santiago de Compostela, Galiza, Spain
Voula Giouli Institute for Language & Speech Processing, ATHENA RC, Greece
Lifeng Han Univ. of Manchester, UK
Joakim Nivre Uppsala University and Research Institutes of Sweden, Sweden
Alexandre Rademaker IBM Research, Brazil

Program Committee

Verginica Barbu Mititelu Romanian Academy
Cherifa Ben Kehlil University of Tours
Philippe Blache Aix-Marseille Uni
Francis Bond Palacký University
Claire Bonial U.S. Army Research Laboratory
Julia Bonn University of Colorado Boulder
Tiberiu Boroș Adobe
Marie Candito Université Paris Cité
Giuseppe G. A. Celano Leipzig Uni
Kenneth Church Baidu
Çağrı Çöltekin University of Tübingen
Mathieu Constant Université de Lorraine
Monika Czerepowicka University of Warmia and Mazury
Daniel Dakota Indiana University
Miryam de Lhoneux KU Leuven
Marie-Catherine de Marneffe UC Louvain
Valeria de Paiva Nuance
Gaël Dias University of Caen Basse-Normandie
Kaja Dobrovoljc University of Ljubljana
Rafael Ehren Heinrich Heine University Düsseldorf
Gülşen Eryiğit Istanbul Technical University
Meghdad Farahmand Berlin, Germany
Christiane Fellbaum Princeton University
Jennifer Foster Dublin City University
Aggeliki Fotopoulou Institute for Language and Speech Processing, ATHENA RC
Stefan Th. Gries UC Santa Barbara & JLU Giessen
Bruno Guillaume Université de Lorraine
Tunga Gungor Bogaziçi University
Eleonora Guzzi Universidade da Coruña
Laura Kallmeyer Heinrich Heine University Düsseldorf
Cvetana Krstev University of Belgrade
Timm Lichte University of Tübingen
Irina Lobzhanidze Ilia State University
Teresa Lynn ADAPT Centre
Stella Markantonatou Institute for Language & Speech Processing, ATHENA RC
John P. McCrae National University of Ireland, Galway
Nurit Melnik The Open University of Israel
Johanna Monti “L’Orientale” University of Naples
Dmitry Nikolaev University of Manchester
Jan Odijk University of Utrecht
Petya Osenova Bulgarian Academy of Sciences
Yannick Parmentier University of Lorraine
Agnieszka Patejuk University of Oxford and Institute of Computer Science, Polish Academy of Sciences
Pavel Pecina Charles University
Ted Pedersen University of Minnesota
Prokopis Prokopidis Institute for Language and Speech Processing, ATHENA RC
Manfred Sailer Goethe-Universität Frankfurt am Main
Tanja Samardžić University of Zurich
Agata Savary Université Paris-Saclay
Nathan Schneider Georgetown University
Sabine Schulte im Walde University of Stuttgart
Sebastian Schuster Saarland University
Matthew Shardlow University of Manchester
Joaquim Silva Universidade NOVA de Lisboa
Maria Simi Università di Pisa
Ranka Stanković University of Belgrade
Ivelina Stoyanova Bulgarian Academy of Sciences
Stan Szpakowicz University of Ottawa
Shiva Taslimipoor University of Cambridge
Beata Trawinski Leibniz Institute for the German Language
Ashwini Vaidya Indian Institute of Technology
Marion Di Marco Ludwig Maximilian University of Munich
Amir Zeldes Georgetown University
Daniel Zeman Charles University

Sponsors and Support

ACL SIGLEX
UniDive
UD
LREC-COLING 2024
Cost Action
EU

Anti-harassment Policy

The workshop follows the LREC/COLING’s anti-harassment policy.


Contact

For any inquiries regarding the workshop, please send an email to the Organizing Committee at mweud2024-organizers@uni-duesseldorf.de.

Please register to SIGLEX and check the “MWE Section” box to be registered to our mailing list.