EACL 2021 Human Evaluation of NLP Systems (HumEval) Proceedings of the Workshop April 19, 2021 Online ©2021 The Association for Computational Linguistics Order copies of this and other ACL proceedings from: Association for Computational Linguistics (ACL) 209 N. Eighth Street Stroudsburg, PA 18360 USA Tel: +1-570-476-8006 Fax: +1-570-476-0860 [email protected] ISBN 978-1-954085-10-7 ii Introduction Welcome to HumEval 2021! We are pleased to present the first workshop on Human Evaluation of NLP Systems (HumEval) that is taking place virtually as part of the 16th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2021). Human evaluation plays an important role in NLP, from the large-scale crowd-sourced evaluations to the much smaller experiments routinely encountered in conference papers. With this workshop we wish to create a forum for current human evaluation research, a space for researchers working with human evaluations to exchange ideas and begin to address the issues that human evaluation in NLP currently faces, including aspects of experimental design, reporting standards, meta-evaluation and reproducibility. The HumEval workshop accepted 9 submissions as long papers, and 6 as short papers. The accepted papers cover a broad range of NLP areas where human evaluation is used: natural language generation, machine translation, summarisation, dialogue, and word embeddings. There are also papers dealing with evaluation practices and methodology in NLP. This workshop would not have been possible without the hard work of the program committee. We would like to express our gratitude to them for writing detailed and thoughtful reviews in a very constrained span of time. We also thank our invited speakers, Lucia Specia, and Margaret Mitchell, for their contribution to our program. As the workshop is part of EACL, we appreciated help from the EACL Workshop Chairs, Jonathan Berant, and Angeliki Lazaridou, from the EACL Publication Chairs, Valerio Basile, and Tommaso Caselli, and we are grateful to all the people involved in setting up the virtual infrastructure. You can find more details about the worskhop on its website: https://humeval.github.io/. Anya, Shubham, Yvette, Ehud, Anastasia iii Organising Committee: Anya Belz, University of Brighton, UK Shubham Agarwal, Heriot-Watt University, UK Yvette Graham, Trinity College Dublin, Ireland Ehud Reiter, University of Aberdeen, UK Anastasia Shimorina, Université de Lorraine / LORIA, Nancy, France Programme Committee: Mohit Bansal, UNC Chapel Hill, US Jackie Chi Kit Cheung, McGill University, Canada Kees van Deemter, Utrecht University, the Netherlands Ondˇrej Dušek, Charles University, Czechia Anette Frank, University of Heidelberg, Germany Albert Gatt, Malta University, Malta Dimitra Gkatzia, Edinburgh Napier University, UK Helen Hastie, Heriot-Watt University, UK Behnam Hedayatnia, Amazon, US David M. Howcroft, Heriot-Watt University, UK Samuel Läubli, University of Zurich, Switzerland Chris van der Lee, Tilburg University, the Netherlands Qun Liu, Huawei Noah’s Ark Lab, China Saad Mahamood, Trivago, Germany Nitika Mathur, University of Melbourne, Australia Margot Mieskes, University of Applied Sciences, Darmstadt, Germany Emiel van Miltenburg, Tilburg University, the Netherlands Mathias Müller, University of Zurich, Switzerland Malvina Nissim, Groningen University, the Netherlands Juri Opitz, University of Heidelberg, Germany Ramakanth Pasunuru, UNC Chapel Hill, US Maxime Peyrard, EPFL, Switzerland Inioluwa Deborah Raji, Mozilla Foundation, US Samira Shaikh, UNC Charlotte, US Wei Zhao, TU Darmstadt, Germany Secondary Reviewers: Antonio Toral, University of Groningen, the Netherlands Invited Speakers: Margaret Mitchell Lucia Specia, Imperial College London v Invited Speaker: Lucia Specia, Imperial College London Disagreement in Human Evaluation: Blame the Task not the Annotators Abstract: It is well known that human evaluators are prone to disagreement and that this is a problem for reliability and reproducibility of evaluation experiments. The reasons for disagreement can fall into two broad categories: (1) human evaluator, including under-trained, under-incentivised, lacking expertise, or ill-intended individuals, e.g., cheaters; and (2) task, including ill-definition, poor guidelines, suboptimal setup, or inherent subjectivity. While in an ideal evaluation experiment many of these elements will be controlled for, I argue that task subjectivity is a much harder issue. In this talk I will cover a number of evaluation experiments on tasks with variable degrees of subjectivity, discuss their levels of disagreement along with other issues, and cover a few practical approaches do address them. I hope this will lead to an open discussion on possible strategies and directions to alleviate this problem. vi Invited Speaker: Margaret Mitchell The Ins and Outs of Ethics-Informed Evaluation Abstract: The modern train/test paradigm in Artificial Intelligence (AI) and Machine Learning (ML) narrows what we can understand about AI models, and skews our understanding of models’ robustness in different environments. In this talk, I will work through the different factors involved in ethics-informed AI evaluation, including connections to ML training and ML fairness, and present an overarching evaluation protocol that addresses a multitude of considerations in developing ethical AI. vii Table of Contents It’s Commonsense, isn’t it? Demystifying Human Evaluations in Commonsense-Enhanced NLG Systems Miruna-Adriana Clinciu, Dimitra Gkatzia and Saad Mahamood. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1 Estimating Subjective Crowd-Evaluations as an Additional Objective to Improve Natural Language Gen- eration Jakob Nyberg, Maike Paetzel and Ramesh Manuvinakurike . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Trading Off Diversity and Quality in Natural Language Generation Hugh Zhang, Daniel Duckworth, Daphne Ippolito and Arvind Neelakantan . . . . . . . . . . . . . . . . . . . 25 Towards Document-Level Human MT Evaluation: On the Issues of Annotator Agreement, Effort and Misevaluation Sheila Castilho . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 Is This Translation Error Critical?: Classification-Based Human and Automatic Machine Translation Evaluation Focusing on Critical Errors Katsuhito Sudoh, Kosuke Takahashi and Satoshi Nakamura . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 Towards Objectively Evaluating the Quality of Generated Medical Summaries Francesco Moramarco, Damir Juric, Aleksandar Savkov and Ehud Reiter. . . . . . . . . . . . . . . . . . . . .56 A Preliminary Study on Evaluating Consultation Notes With Post-Editing Francesco Moramarco, Alex Papadopoulos Korfiatis, Aleksandar Savkov and Ehud Reiter . . . . . 62 The Great Misalignment Problem in Human Evaluation of NLP Methods Mika Hämäläinen and Khalid Alnajjar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 A View From the Crowd: Evaluation Challenges for Time-Offset Interaction Applications Alberto Chierici and Nizar Habash . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 Reliability of Human Evaluation for Text Summarization: Lessons Learned and Challenges Ahead Neslihan Iskender, Tim Polzehl and Sebastian Möller. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .86 On User Interfaces for Large-Scale Document-Level Human Evaluation of Machine Translation Outputs Roman Grundkiewicz, Marcin Junczys-Dowmunt, Christian Federmann and Tom Kocmi . . . . . . 97 Eliciting Explicit Knowledge From Domain Experts in Direct Intrinsic Evaluation of Word Embeddings for Specialized Domains Goya van Boven and Jelke Bloem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 Detecting Post-Edited References and Their Effect on Human Evaluation Vˇera Kloudová, Ondˇrej Bojar and Martin Popel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .114 A Case Study of Efficacy and Challenges in Practical Human-in-Loop Evaluation of NLP Systems Using Checklist Shaily Bhatt, Rahul Jain, Sandipan Dandapat and Sunayana Sitaram . . . . . . . . . . . . . . . . . . . . . . . . 120 Interrater Disagreement Resolution: A Systematic Procedure to Reach Consensus in Annotation Tasks Yvette Oortwijn, Thijs Ossenkoppele and Arianna Betti . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 ix Workshop Program Monday, April 19, 2021 9:00–9:10 Opening Anya Belz 9:10–10:00 Invited Talk: Lucia Specia 10:00–11:00 Oral Session 1: NLG 10:00–10:20 It’s Commonsense, isn’t it? Demystifying Human Evaluations in Commonsense- Enhanced NLG Systems Miruna-Adriana Clinciu, Dimitra Gkatzia and Saad Mahamood 10:20–10:40 Estimating Subjective Crowd-Evaluations as an Additional Objective to Improve Natural Language Generation Jakob Nyberg, Maike Paetzel and Ramesh Manuvinakurike 10:40–11:00 Trading Off Diversity and Quality in Natural Language Generation Hugh Zhang, Daniel Duckworth, Daphne Ippolito and Arvind Neelakantan 11:00–11:30 Break 11:30–12:10 Oral Session 2: MT 11:30–11:50 Towards Document-Level Human MT Evaluation: On the Issues of Annotator Agreement, Effort and Misevaluation Sheila Castilho 11:50–12:10 Is This Translation Error Critical?: Classification-Based Human and Automatic Machine Translation Evaluation Focusing on Critical Errors Katsuhito Sudoh, Kosuke Takahashi and Satoshi Nakamura xi Monday, April 19, 2021 (continued) 12:10–13:30 Poster Session 12:10–13:30 Towards Objectively Evaluating the Quality of Generated Medical Summaries Francesco Moramarco, Damir Juric, Aleksandar Savkov and Ehud Reiter 12:10–13:30 A Preliminary Study on Evaluating Consultation Notes With Post-Editing Francesco Moramarco, Alex Papadopoulos Korfiatis, Aleksandar Savkov and Ehud Reiter 12:10–13:30 The Great Misalignment Problem in Human Evaluation of NLP Methods Mika Hämäläinen and Khalid Alnajjar 12:10–13:30 A View From the Crowd: Evaluation Challenges for Time-Offset Interaction Appli- cations Alberto Chierici and Nizar Habash 12:10–13:30 Reliability of Human Evaluation for Text Summarization: Lessons Learned and Challenges Ahead Neslihan Iskender, Tim Polzehl and Sebastian Möller 12:10–13:30 On User Interfaces for Large-Scale Document-Level Human Evaluation of Machine Translation Outputs Roman Grundkiewicz, Marcin Junczys-Dowmunt, Christian Federmann and Tom Kocmi 12:10–13:30 Eliciting Explicit Knowledge From Domain Experts in Direct Intrinsic Evaluation of Word Embeddings for Specialized Domains Goya van Boven and Jelke Bloem 12:10–13:30 Detecting Post-Edited References and Their Effect on Human Evaluation Vˇera Kloudová, Ondˇrej Bojar and Martin Popel 13:30–15:00 Lunch xii Monday, April 19, 2021 (continued) 15:00–15:40 Oral Session 3 15:00–15:20 A Case Study of Efficacy and Challenges in Practical Human-in-Loop Evaluation of NLP Systems Using Checklist Shaily Bhatt, Rahul Jain, Sandipan Dandapat and Sunayana Sitaram 15:20–15:40 Interrater Disagreement Resolution: A Systematic Procedure to Reach Consensus in Annotation Tasks Yvette Oortwijn, Thijs Ossenkoppele and Arianna Betti 15:40–16:40 Discussion Panel Ehud Reiter 16:40–17:00 Break 17:00–17:50 Invited Talk: Margaret Mitchell 17:50–18:00 Closing Yvette Graham xiii Proceedings of the Workshop on Human Evaluation of NLP Systems (HumEval), pages 1–12 Online, April 19, 2021. ©2021 Association for Computational Linguistics It’s Common Sense, isn’t it? Demystifying Human Evaluations in Commonsense-enhanced NLG systems Miruna Clinciu1∗, Dimitra Gkatzia2* �, and Saad Mahamood3* 1Heriot-Watt University, Edinburgh, Scotland, UK 2Edinburgh Napier University, Edinburgh, Scotland, UK 3trivago N.V., D¨usseldorf, Germany � Corresponding author: [email protected] Abstract Common sense is an integral part of human cognition which allows us to make sound de- cisions, communicate effectively with others and interpret situations and utterances. Endow- ing AI systems with commonsense knowledge capabilities will help us get closer to creating systems that exhibit human intelligence. Re- cent efforts in Natural Language Generation (NLG) have focused on incorporating com- monsense knowledge through large-scale pre- trained language models or by incorporating external knowledge bases. Such systems ex- hibit reasoning capabilities without common sense being explicitly encoded in the training set. These systems require careful evaluation, as they incorporate additional resources during training which adds additional sources of er- rors. Additionally, human evaluation of such systems can have significant variation, mak- ing it impossible to compare different systems and define baselines. This paper aims to de- mystify human evaluations of commonsense- enhanced NLG systems by proposing the Com- monsense Evaluation Card (CEC), a set of recommendations for evaluation reporting of commonsense-enhanced NLG systems, under- pinned by an extensive analysis of human eval- uations reported in the recent literature. 1 Introduction Commonsense knowledge is vital for human com- munication, as it helps us make inferences with- out explicitly mentioning the context. Recently, there has been an interest in developing Natural Language Generation (NLG) systems that exhibit commonsense abilities (e.g. (Lin et al., 2020)). Al- though everyone understands what common sense is, defining it remains a challenge as it is highly context-dependent. Common sense can be defined as “simple wisdom” (Oxford English Dictionary ∗* Equal Contribution online), “the ability to use good judgment in mak- ing decisions and to live in a reasonable and safe way” (Cambridge dictionary), or as a “sound and prudent judgment based on a simple perception of the situation or facts” (Mirriam Webster). Com- mon sense involves language understanding and reasoning abilities, representing a key factor for establishing effective interactions between humans and machines (Minsky, 1991). In his pioneering work, McCarthy (1959) proposes that “a program has common sense if it automatically deduces for itself a sufficiently wide class of immediate conse- quences of anything it is told and what it already knows”. Traditionally, commonsense knowledge has been injected in NLG systems either implicitly in the form of rules and/or explicitly with semantic representations in the form of external knowledge bases or ontologies. For instance, expert domain NLG systems (such as the BabyTalk system (Portet et al., 2008)) have incorporated external knowledge in the form of a clinical ontology. In these expert domain NLG systems, knowledge (which might include procedural knowledge) is represented in rules that are built into the system and have been acquired through experts via interviews, observa- tions or other approaches (Reiter et al., 2003). Most recent challenges have focused on injecting com- monsense knowledge into neural NLG models in two ways: through pre-trained models and through utilising commonsense graphs or knowledge bases. The former assumes that pre-trained models al- ready contain commonsense knowledge (Petroni et al., 2019). The latter incorporate entity relation- ships derived from semantic graphs (e.g. Concept- Net (Speer et al., 2016)) or knowledge bases (e.g. (Sydorova et al., 2019)). It is clear that the incorporation of external knowledge of some form has always been at the heart of NLG system development. In this paper, 1
2021 • 11 Pages • 189.66 KB
2022 • 64 Pages • 5.56 MB
2021 • 14 Pages • 87.71 KB
2022 • 271 Pages • 26.43 MB
2022 • 129 Pages • 3.93 MB
2019 • 138 Pages • 7.02 MB
2022 • 10 Pages • 112.98 KB
2022 • 16 Pages • 462.83 KB
2022 • 125 Pages • 3.18 MB
2022 • 10 Pages • 93.64 KB
2022 • 12 Pages • 92.57 KB
2022 • 10 Pages • 92.38 KB
2022 • 10 Pages • 108.15 KB
2022 • 12 Pages • 131.89 KB
2000 • 10 Pages • 88.3 KB
2022 • 14 Pages • 962.63 KB