A Method for Automatic Evaluation of NLP Test Cases - AEON

A Method for Automatic Evaluation of NLP Test Cases - AEON (PDF)

2022 • 13 Pages • 870.43 KB • English
Posted June 30, 2022 • Submitted by pdf.user

Visit PDF download

Download PDF To download page

Summary of A Method for Automatic Evaluation of NLP Test Cases - AEON

AEON: A Method for Automatic Evaluation of NLP Test Cases Jen-tse Huang The Chinese University of Hong Kong Hong Kong, China [email protected] Jianping Zhang The Chinese University of Hong Kong Hong Kong, China [email protected] Wenxuan Wang The Chinese University of Hong Kong Hong Kong, China [email protected] Pinjia He∗ The Chinese University of Hong Kong, Shenzhen Shenzhen, China [email protected] Yuxin Su Sun Yat-sen University China [email protected] Michael R. Lyu The Chinese University of Hong Kong Hong Kong, China [email protected] ABSTRACT Due to the labor-intensive nature of manual test oracle construc- tion, various automated testing techniques have been proposed to enhance the reliability of Natural Language Processing (NLP) software. In theory, these techniques mutate an existing test case (e.g., a sentence with its label) and assume the generated one pre- serves an equivalent or similar semantic meaning and thus, the same label. However, in practice, many of the generated test cases fail to preserve similar semantic meaning and are unnatural (e.g., grammar errors), which leads to a high false alarm rate and un- natural test cases. Our evaluation study finds that 44% of the test cases generated by the state-of-the-art (SOTA) approaches are false alarms. These test cases require extensive manual checking effort, and instead of improving NLP software, they can even degrade NLP software when utilized in model training. To address this problem, we propose AEON for Automatic Evaluation Of NLP test cases. For each generated test case, it outputs scores based on semantic simi- larity and language naturalness. We employ AEON to evaluate test cases generated by four popular testing techniques on five datasets across three typical NLP tasks. The results show that AEON aligns the best with human judgment. In particular, AEON achieves the best average precision in detecting semantic inconsistent test cases, outperforming the best baseline metric by 10%. In addition, AEON also has the highest average precision of finding unnatural test cases, surpassing the baselines by more than 15%. Moreover, model training with test cases prioritized by AEON leads to models that are more accurate and robust, demonstrating AEON’s potential in improving NLP software. CCS CONCEPTS • Software and its engineering → Software testing and de- bugging. ∗Corresponding author. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permiss[email protected]. ISSTA 2022, 18-22 July, 2022, Daejeon, South Korea © 2022 Association for Computing Machinery. ACM ISBN 978-1-4503-XXXX-X/18/06...$15.00 https://doi.org/10.1145/1122445.1122456 KEYWORDS NLP software testing, test case quality ACM Reference Format: Jen-tse Huang, Jianping Zhang, Wenxuan Wang, Pinjia He, Yuxin Su, and Michael R. Lyu. 2022. AEON: A Method for Automatic Evaluation of NLP Test Cases. In ISSTA ’22: ACM SIGSOFT International Symp osium on Software Testing and Analysis, 18-22 July, 2022, Daejeon, South Korea. ACM, New York, NY, USA, 13 pages. https://doi.org/10.1145/1122445.1122456 1 INTRODUCTION NLP software has become increasingly popular in our daily lives. For example, NLP virtual assistant software, such as Siri and Alexa, receives billions of requests [48, 64] while Google Translate App translates more than 100 billion words per day [69]. With the devel- opment of Deep Neural Networks (DNNs), the performance of NLP software has been largely boosted. Equipped with the SOTA model [70], Microsoft question answering robot surpasses humans on con- versational question answering task. In addition, the performance of machine comprehension [19], text generation [41] and machine translation [73] has been significantly improved. However, NLP software can produce erroneous results, leading to misunderstand- ing, financial loss, threats to personal safety, and political conflicts [53, 54]. To discover erroneous behaviors in NLP software, researchers have designed various software testing techniques [10, 25, 40, 63, 81]. A test case for NLP software is in the form of a text (e.g., a sen- tence) and its label, where the label is the expected correct output of the NLP software. In theory, most of these testing techniques modify part(s) of the input text (e.g., a word/character substitu- tion/insertion/deletion) under the assumption that the generated test case preserves an equivalent or similar semantic meaning. Typ- ically, these techniques take labeled texts as inputs and output the mutated texts and the corresponding labels. However, it is still challenging for current testing techniques to produce practical test cases of high quality. Specifically, tiny modification in a text can change its semantic meaning, which invalidates the common assumption that the semantic meaning of the original text and that of the generated text should remain equivalent or similar, further rendering the possibility of changing the corresponding labels [49, 51]. For example, removing “not” from the text “I do not like the movie” changes its semantic meaning and further changes its label for a sentiment analysis task from “negative” to “positive”, resulting in a test case with an incorrect label arXiv:2205.06439v1 [cs.SE] 13 May 2022 ISSTA 2022, 18-22 July, 2022, Daejeon, South Korea Huang and Zhang, et al. Table 1: Examples for high-quality, inconsistent, and unnatural test cases generated by existing testing techniques on differ- ent datasets. NLP tasks include Sentiment Analysis (SA), Natural Language Inference (NLI), and Semantic Equivalence (SE). Mutated words are marked in red. Original text Task Technique Generated test case Dataset Issue A man under a running shower with shampoo in his hair. ⇒ A man is taking a shower. NLI BAE [20] A man under a running shower with shampoo in his hair. ⇒ A man is taking a bath. SNLI [3] None Ultimately this is a frustrating patchwork. SA PSO [81] Ultimately this is a sparkling patchwork. MR [55] Inconsistent British action wouldn’t have mattered. ⇒ British action would have made a big difference. NLI BAE [20] Welsh action wouldn’t have mattered. ⇒ British action would have made a big difference. MNLI [74] Inconsistent What are some good topics to be bookmarked on Quora? SE Textfooler [33] What are some good topics to es bookmarked on Quora? QQP [71] Unnatural I went to Danny’s this weekend to get an oil change and car wash and I paid for a VIP car wash. SA BAE [20] My gone to work this weekend to do an oil change and car wash and my hired for a VIP car wash. Yelp [88] Unnatural and further a false alarm. Moreover, existing testing approaches cannot guarantee the fluency and naturalness of the generated test cases. Many word-level testing approaches introduce grammar errors and punctuation errors, and sometimes they introduce words that do not exist or are rarely used [49]. Although these test cases may trigger “software errors” (e.g., unexpected software behaviors), it is important to first ensure the quality of the test cases in terms of semantic consistency and naturalness before finding more errors. According to our user study, many of the NLP test cases gen- erated by existing approaches are of low quality because of the following two issues: Inconsistent issue and Unnatural issue. These issues can lead to false alarms in testing and unnaturalness in lan- guage. In this paper, we say an NLP test case is of high quality if it does not have any of these issues. As shown in Table 1, a high- quality test case preserves the semantics of the original text and reads smoothly. The first Inconsistent case changes the semantics to the opposite while the second one changes the subjects. Two Unnatural cases hurt the fluency and naturalness of natural lan- guage by introducing either non-existing words or wrong grammar. It is unlikely that these low-quality test cases can contribute to improving NLP software in practice. Hence, an automatic quality evaluation metric that can help filter out low-quality test cases generated by the existing testing tech- niques is highly in demand. Nevertheless, designing an automatic quality evaluation metric for NLP test cases is highly challenging. First, existing testing criteria are mainly based on coverage met- rics, such as code coverage for traditional software [8] and neuron coverage for deep neural networks [57], which cannot be directly leveraged to detect false alarms and evaluate the quality of a natural language test case. Second, general semantic similarity evaluation metrics fail to detect Inconsistent issues under this scenario. Specifi- cally, (1) most of the words in the original text and the generated text are the same while existing metrics evaluate the semantic simi- larity based on all the words in the text and thus, the impact of the mutated word(s) easily vanish; (2) a word may have different mean- ings in different contexts, making it difficult to compare only the mutated word(s). Third, existing work on naturalness evaluation metric either relies on human evaluation [51] or qualitative analysis (e.g., part-of-speech checking [20]), while we need an automatic and quantitative naturalness evaluation metric. To address these problems, we introduce AEON, a method for Automatic Evaluation Of NLP test cases. AEON takes a text pair <original text, generated text> as input and outputs scores regarding semantic similarity and syntactic correctness, aiming for detecting Inconsistent and Unnatural issues, respectively. We use AEON to analyze the quality of NLP test cases generated from four popular testing techniques [20, 32, 63, 81] on five datasets [3, 55, 71, 74, 88] which cover three typical NLP tasks, namely natural language in- ference, sentiment analysis, and semantic equivalence. We conduct a comprehensive human evaluation on the semantic similarity and language naturalness between the original texts and the generated test cases, and we check whether AEON’s score aligns with human evaluation or not. The results show that AEON achieves the Average Precision (AP), Area Under Curve (AUC), and Pearson Correlation Coefficient (PCC) scores of 0.688, 0.742, and 0.922, outperforming the best baseline metric by 10%, 8.1%, and 7.8% respectively. On the evaluation of human judgment of language naturalness, AEON also surpasses all baselines and achieves the average AP, AUC, PCC scores of 0.69, 0.63, 0.82. These results demonstrate the effectiveness AEON on detecting false alarms and evaluating the language natu- ralness of NLP test cases. We also show that the high-quality test cases selected by AEON can significantly improve the accuracy and robustness of NLP software via model training. Our contributions can be summarized as: • We conduct a comprehensive user study on the test cases generated by existing NLP software testing techniques and find that 85% of them suffer from two issues: Inconsistent and Unnatural, resulting in a false alarm rate of 44%. • We introduce AEON, the first approach to quantitatively evaluate the quality of NLP test cases from semantics and language naturalness, addressing two main quality issues of NLP test cases mentioned above. AEON: A Method for Automatic Evaluation of NLP Test Cases ISSTA 2022, 18-22 July, 2022, Daejeon, South Korea Table 2: Details of the selected testing techniques. Technique Selection Substitution Constraints Generative Algorithm (GA) [32] k-nearest neighbors Random combination Percentage of modified words; Euclidean distance; LM grammar checking BERT-base Adversarial Examples (BAE) [20] Word importance PLM mask prediction Euclidean distance; Part-of-speech checking Particle Swarm Optimization (PSO) [81] Optimization [35] Knowledge graph [14] None Checklist [63] Random selection Transformations: Contraction; Extension; Changing entities Numbers of transformation • AEON is employed to evaluate the test cases generated by four testing techniques on five widely-used datasets, which shows that AEON achieves the best performance in terms of average AP, AUC, and PCC on all datasets. • The implementation of AEON, the raw experimental results, and the human annotation on the test case quality are avail- able on Github1. 2 PRELIMINARIES 2.1 Testing Techniques for NLP Software Though many papers have proposed testing techniques for Com- puter Vision (CV) software (e.g., face recognition system) [5, 57, 68, 79], the characteristics of natural language make NLP software testing distinguished from that in CV software. The most significant difference between NLP test cases and CV test cases is that the input space of textual data is not as continuous as images, making every mutation in the original text perceptible. In addition, in natural language, mutating a single word can cause considerable semantic differences, which further leads to the risk of changing the correct label of the text. Therefore, when NLP testing techniques assign the label of the original text to the generated test cases, lots of false alarms occur. Current testing techniques2 for NLP software can be roughly divided into four categories: character-level, word-level, sentence- level, and multi-level [87]. Character-level techniques [40] mutate a few characters that do not affect human reading comprehension. Word-level techniques [62, 63] are based on word substitution, usu- ally using synonyms sets or Pre-trained Language Models (PLMs). Sentence-level techniques [26] change the whole structure of the sentences either by adding a sentence to the original texts or trans- forming the entire texts into another semantically similar format. Those combining different levels of techniques [42] can be cate- gorized into multi-level techniques. In particular, word-level tech- niques significantly outperform others in terms of efficiency [81], applicability, and usefulness in robust training [62, 81]. However, this kind of technique suffers more from low-quality test cases 1https://github.com/CUHK-ARISE/AEON 2In this paper, we consider papers on attacking NLP models as a line of research on testing NLP software because the adversarial examples generated by these techniques can be regarded as test cases for NLP software. [51]. Thus, we focus on test cases generated by word-level testing techniques. From the perspective of combinatorial optimization, generating test cases with word-level techniques can be formulated as a search- ing problem, where we substitute each word in the original text to other words in our vocabulary. The whole search space is the number of words in original text 𝑁 (where we substitute) times the vocabulary size 𝑉 (word candidates). In general, these techniques include diverse modules to prune the search space, which can be classified into three components: target word selection, word sub- stitution, and generation constraints [52, 87]. Table 2 presents the modules of the four selected testing techniques in terms of the three components. A suitable target word selection method can decrease 𝑁 while a proper word substitution method can cut back 𝑉 . Constraints are commonly applied to ensure that the synthesized texts preserve semantic meaning and are syntactically correct. 2.2 Problem Definition Given NLP software F : X → Y which takes a text 𝑥 in text space X as input and outputs its prediction 𝑦 ∈ Y, an word-level test case ˆ𝑥 is synthesized from a seed datum 𝑥 whose ground truth (i.e., label) is 𝑦 (denoting with 𝑔𝑡(𝑥) = 𝑦). It also needs to satisfy that 𝑆𝑖𝑚( ˆ𝑥, 𝑥) ≥ 𝑐, where 𝑆𝑖𝑚 : X × X → R is a similarity metric. Intuitively, it means ˆ𝑥 and 𝑥 have equivalent or similar semantics. 𝑐 is a task-specific constant to trade between semantic similarity and generation diversity. Most NLP testing techniques assume that the generated test case and the original text have the same label, i.e., 𝑔𝑡( ˆ𝑥) = 𝑔𝑡(𝑥) = 𝑦. Given a set of generated test cases ˆ𝑋, sometimes the similarity metric it uses may not be able to detect some inconsistency, then Inconsistent test cases occur. A test case ˆ𝑥 ∈ ˆ𝑋 is Inconsistent when 𝑆𝑖𝑚′( ˆ𝑥,𝑥) ≤ 𝑡𝑠. Here 𝑡𝑠 is a threshold, and 𝑆𝑖𝑚′ is a trustable and robust similarity metric, for example, human judgment. There is a high chance that Inconsistent test cases are false alarms, which sat- isfy 𝑔𝑡( ˆ𝑥) ≠ 𝑔𝑡(𝑥) and F ( ˆ𝑥) = 𝑔𝑡( ˆ𝑥). In other words, the software behaves correctly (produces label 𝑔𝑡( ˆ𝑥)), but we wrongly assume it needs to produce label𝑔𝑡(𝑥). Whether a test case is Unnatural or not depends on human evaluation. It is defined as 𝑁𝑎𝑡 ′( ˆ𝑥) ≤ 𝑡𝑛 where 𝑁𝑎𝑡 ′ : X → R is human judgment on the language naturalness of the textual test case and 𝑡𝑛 is a threshold. ISSTA 2022, 18-22 July, 2022, Daejeon, South Korea Huang and Zhang, et al. Algorithm 1 Algorithm for SemEval Input: Original text 𝑥; Generated test case ˆ𝑥 Output: Semantic similarity score 𝑆𝑖𝑚(𝑥, ˆ𝑥) ∈ [0, 1] 1: 𝑥 ← Tokenize(𝑥) 2: ˆ𝑥 ← Tokenize(ˆ𝑥) 3: 𝑑𝑖𝑓 𝑓 _𝑖𝑛𝑑𝑖𝑐𝑒𝑠 ← LevenshteinDistance(𝑥, ˆ𝑥) 4: 𝑒𝑚𝑏(𝑥) ← GetWordEmbedding(𝑥) 5: 𝑒𝑚𝑏( ˆ𝑥) ← GetWordEmbedding(ˆ𝑥) 6: 𝑝𝑎𝑡𝑐ℎ_𝑠𝑖𝑚 ← a list 7: for each 𝑖 ∈ 𝑑𝑖𝑓 𝑓 _𝑖𝑛𝑑𝑖𝑐𝑒𝑠 do 8: 𝑝𝑎𝑡𝑐ℎ_𝑥 ← 𝑒𝑚𝑏(𝑥)[𝑖 − 2 : 𝑖 + 2] 9: 𝑝𝑎𝑡𝑐ℎ_ˆ𝑥 ← 𝑒𝑚𝑏( ˆ𝑥)[𝑖 − 2 : 𝑖 + 2] 10: Append 𝑝𝑎𝑡𝑐ℎ_𝑥𝑇 𝑝𝑎𝑡𝑐ℎ_ ˆ𝑥 ∥𝑝𝑎𝑡𝑐ℎ_𝑥 ∥·∥𝑝𝑎𝑡𝑐ℎ_ ˆ𝑥 ∥ to 𝑝𝑎𝑡𝑐ℎ_𝑠𝑖𝑚 11: end for 12: 𝑡𝑒𝑥𝑡_𝑠𝑖𝑚 ← 𝑒𝑚𝑏(𝑥)𝑇 𝑒𝑚𝑏( ˆ𝑥) ∥𝑒𝑚𝑏(𝑥) ∥·∥𝑒𝑚𝑏( ˆ𝑥) ∥ 13: 𝑚𝑖𝑛_𝑠𝑖𝑚 ← min(𝑝𝑎𝑡𝑐ℎ_𝑠𝑖𝑚) 14: 𝑎𝑣𝑔_𝑠𝑖𝑚 ← average(𝑝𝑎𝑡𝑐ℎ_𝑠𝑖𝑚) 15: 𝑆𝑖𝑚(𝑥, ˆ𝑥) ← 𝜆1𝑚𝑖𝑛_𝑠𝑖𝑚 + 𝜆2𝑎𝑣𝑔_𝑠𝑖𝑚 + (1 − 𝜆1 − 𝜆2)𝑡𝑒𝑥𝑡_𝑠𝑖𝑚 The task of this paper is to design an automatic evaluation metric that can reflect test case quality in terms of semantic consistency and naturalness, which facilitates the detection of Inconsistent (false alarms) and Unnatural test cases. 3 APPROACHES AND IMPLEMENTATION This section introduces the details of AEON whose input is a text pair <original text, generated text> and outputs are a semantic score and a syntactic score. AEON consists of two parts: SemEval (Semantic Evaluator), which captures the semantic difference be- tween input text pair, and SynEval (Syntactic Evaluator), which assesses how likely the generated test case will be used (i.e., writ- ten or typed) by real users. These two components aim to address Inconsistent and Unnatural issues, respectively. In the rest of this section, we will introduce the details of the key components of the two evaluators. 3.1 SemEval SemEval aims to solve the two challenges mentioned above. (1) The influence of the mutated position can easily vanish when taking average since most words in the original text and the generated test case are the same. (2) Metrics comparing words without contexts can neglect their alternative meanings (i.e., polysemy). To this end, we propose to combine Levenshtein distance [38] and sentence embedding model to evaluate the semantic similarity in the NLP testing scenario. The approach is surprisingly effective considering its simplicity, which is shown in Alg. 1. After tokenizing the input texts (line 1-2), which converts all words and punctuation as individual tokens, SemEval extracts small patches of text where the two inputs differ using Levenshtein dis- tance (line 3). With the help of Levenshtein distance, we can find all mutated positions in linear time. Next, it applies a PLM to obtain the embeddings of all tokens in the two inputs (line 4-5). Current PLMs [11, 13, 18, 43] can be leveraged to project tokens to the embedding space, so this module can be replaced easily when more powerful Algorithm 2 Algorithm for SynEval Input: Generated test case ˆ𝑥 Output: Language naturalness score 𝑁𝑎𝑡( ˆ𝑥) ∈ (0, 1] 1: ˆ𝑥 ← Tokenize(ˆ𝑥) 2: 𝑝𝑒𝑟𝑝𝑙𝑒𝑥𝑖𝑡𝑦 ← a list 3: for each 𝑡𝑜𝑘𝑒𝑛 ∈ ˆ𝑥 do 4: 𝑚𝑎𝑠𝑘𝑒𝑑_𝑡𝑒𝑥𝑡 ← replace 𝑡𝑜𝑘𝑒𝑛 with [MASK] in ˆ𝑥 5: 𝑝𝑟𝑜𝑏 ← GetMaskPrediction(𝑚𝑎𝑠𝑘𝑒𝑑_𝑡𝑒𝑥𝑡) 6: 𝑝𝑟𝑜𝑏_𝑡𝑜𝑘𝑒𝑛 ← 𝑝𝑟𝑜𝑏[𝑡𝑜𝑘𝑒𝑛] 7: Append 𝑝𝑟𝑜𝑏_𝑡𝑜𝑘𝑒𝑛 to 𝑝𝑒𝑟𝑝𝑙𝑒𝑥𝑖𝑡𝑦 8: end for 9: 𝑚𝑖𝑛_𝑛𝑎𝑡 ← min(𝑝𝑒𝑟𝑝𝑙𝑒𝑥𝑖𝑡𝑦) 10: 𝑎𝑣𝑔_𝑛𝑎𝑡 ← average(𝑝𝑒𝑟𝑝𝑙𝑒𝑥𝑖𝑡𝑦) 11: 𝑁𝑎𝑡( ˆ𝑥) ← 𝜙𝑚𝑖𝑛_𝑛𝑎𝑡 + (1 − 𝜙)𝑎𝑣𝑔_𝑛𝑎𝑡 PLMs are proposed. Then, for all patches and the whole text, we compute the cosine similarity defined by 𝑎𝑇𝑏 ∥𝑎∥·∥𝑏 ∥ in line 7-9 and line 12, respectively. Note that we extract totally five tokens as the patch for a mutation happened in position 𝑖 by [𝑖 − 2 : 𝑖 + 2]. For a mutation at the beginning or the end of a sentence, we extract the first or last three tokens as our patch. Finally, we compute the min- imum and the average numbers among all patch similarities (line 13-14) and combine them with the text similarity using two hyper- parameters, 𝜆1 and 𝜆2 (line 15). After this convex combination, we obtain the output of SemEval, namely 𝑆𝑖𝑚(𝑥, ˆ𝑥). We tackle challenge (1) by considering the minimum and average patch similarities. For challenge (2), AEON extract the mutated position along with its context, which can improve its ability to understand semantics. Consider an example: Case Study 1 Task/Dataset SE/QQP Techniqe BAE Original Text Is it OK to leave an iPhone plugged into the charger after 100% charged? Generated Text Is it OK to leave an iPhone plugged into the charger after 100% indicted? If we only consider the mutated position charged and indicted, the similarity is high since they are synonyms in the meaning of “being accused”. However, charged here means "to put electricity into an electrical device". This kind of relationship can be captured by its context, which is modeled in the PLMs. 3.2 SynEval Since synthesized test cases may include grammar errors, punctu- ation errors, or produce rarely used words and phrases, it is vital to use an automatic and quantitative metric to filter out these Un- natural test cases. Note that this kind of sentence rarely appears in real-world natural languages, hence they are treated as noises and ignored during the training process of PLMs [13, 37, 43]. Intuitively, how natural a sentence is can be reflected by the probability that the sentence has the same distribution as its training data, which can be estimated by PLMs. Therefore, SynEval is designed to mea- sure naturalness through the perplexity of PLMs. Perplexity, in its AEON: A Method for Automatic Evaluation of NLP Test Cases ISSTA 2022, 18-22 July, 2022, Daejeon, South Korea Table 3: Details of the selected datasets. NLP tasks include Sentiment Analysis (SA), Natural Language Inference (NLI), and Semantic Equivalence (SE) Dataset Task Classes Description Rotten Tomatoes Movie Review (MR) [55] SA 2 Short sentences or phrases of movie reviews Yelp Restaurant Review (Yelp) [88] SA 2 Long sentences or paragraphs of restaurant reviews Stanford Natural Language Inference (SNLI) [3] NLI 3 Short texts with simple contexts Multi-Natural Language Inference (MNLI) [74] NLI 3 Multi-genre, multi-length texts with complicated contexts Quora Question Pairs (QQP) [71] SE 2 Two similar questions from Quora formal definition, is the exponential form of the cross entropy of the given sentence [31], having the form of: 𝑝𝑒𝑟𝑝𝑙𝑒𝑥𝑖𝑡𝑦(𝑥) = 𝑁 � � � 𝑁 � 𝑖=1 1 𝑃(𝑥𝑖 |𝑥1:𝑖−1) , (1) where 𝑥𝑖 is the 𝑖-th word in the sentence and 𝑥1:𝑖−1 is the first to 𝑖 − 1-th words in the sentence. 𝑝𝑒𝑟𝑝𝑙𝑒𝑥𝑖𝑡𝑦 : X → [1, ∞) measures how confused the PLM is when it sees 𝑥𝑖 given 𝑥1:𝑖−1, the greater the more confused (i.e., worse). The recently proposed BERT-like models, including BERT [13], RoBERTa [43], and ALBERT [37] which trained on billions of sen- tences, are powerful PLMs for modeling this probability. However, BERT and its variants are bi-directional, taking not only 𝑥1:𝑖−1 but also 𝑥𝑖+1:𝑛 as input. Therefore, we need to replace 𝑥1:𝑖−1 with 𝑥\𝑖 in Eq. 1, where 𝑥\𝑖 denotes the input sentence with its 𝑖-th word being [MASK]. Since our semantic evaluator outputs similarity scores in (0, 1] (the greater, the more similar), we adopt 𝑁√︃�𝑁 𝑖=1 𝑃(𝑥𝑖 |𝑥\𝑖) for SynEval, having the same value range of (0, 1] (the greater, the better). Alg. 2 illustrates the implementation of SynEval. First we tok- enize the input (line 1). Then, for each token in the input, we replace it with the special token [MASK] (line 4). Feeding the masked text to the PLM, we can obtain the prediction of the masked position, which is a probability distribution over the entire vocabulary (line 5). Next, we find out the probability that the PLM thinks the masked position can be filled with the original token and record it as the perplexity of this token (line 6-7). Finally, we compute the minimum and the average numbers among all perplexities and combine them using a hyper-parameter 𝜙. The score after this convex combination is the output of SynEval, namely 𝑁𝑎𝑡( ˆ𝑥). 4 EXPERIMENTAL DESIGN AND SETTINGS In this paper, we focus on the following four research questions: RQ1: What is the quality of the test cases generated by existing testing techniques (Section 5.1)? RQ2: How effective is AEON (Section 5.2)? RQ3: How can AEON help in testing NLP software? (Section 5.3) RQ4: How can AEON help in improving NLP model? (Section 5.4) 4.1 Testing NLP Software To answer the RQs, the first step is generating test cases, i.e., testing NLP software. We choose to test the APIs provided by Hugging Face Inc.3, the largest NLP open-source community, on five widely- used datasets across three typical tasks: sentiment analysis, natural language inference, and semantic equivalence. Datasets. Sentiment analysis aims at classifying the polarity (either positive or negative) of the sentiment of given texts. The inputs of natural language inference tasks are two pieces of texts, namely Premise and Hypothesis, and the target is to predict whether the Hypothesis is a contradiction, entailment, or neutral to the given Premise. If the Premise can infer the Hypothesis, the output is entailment; if the Premise can infer NOT Hypothesis, the output is contradiction; otherwise, the output is neutral. The inputs of semantic equivalence tasks are two pieces of text, namely question 1 and question 2, and the objective is to judge if the meaning of the two given questions is equivalent. We select five datasets, namely MR, Yelp, SNLI, MNLI, and QQP, for our experiments, whose details are shown in Table 3. MR and Yelp are crawled from the internet, so the data contain noises such as HTML tags, HTML encodings, HTML entity names, and hyperlinks, which will make the generated test case hard to read. To eliminate the influence of noisy data in our human evaluation, we convert HTML texts to plain texts and remove hyperlinks using regular expressions. Testing. To be more specific, we choose five BERT-based APIs4 for five different datasets. According to the statistics given by Hug- ging Face Inc, these APIs are downloaded more than 30k times every month on average. Using the testing techniques described in Table 2 implemented by TextAttack [52] with their default settings, we generate test cases for all datasets (APIs). We select 400 original texts for each dataset using each technique, resulting in 8,000 test cases. After testing the APIs with our test cases, 3,262 test cases (40.8%) are reported as software errors. 4.2 Human Evaluation We aim to find out whether the reported cases really trigger the erroneous behaviors of NLP software, in other words, whether they are false alarms. To this end, we design and launch a user study. Design. Following [6], we propose a unified framework to mea- sure the quality of generated test cases. The quality is defined from 3https://huggingface.co/ 4https://huggingface.co/textattack/bert-base-uncased-rotten-tomatoes https://huggingface.co/textattack/bert-base-uncased-yelp-polarity https://huggingface.co/textattack/bert-base-uncased-MNLI https://huggingface.co/textattack/bert-base-uncased-snli https://huggingface.co/textattack/bert-base-uncased-QQP ISSTA 2022, 18-22 July, 2022, Daejeon, South Korea Huang and Zhang, et al. Table 4: Details of the selected baselines. Baseline Description NC-based NC [57] The ratio of activated neurons NBC [44] Activation outside upper/lower bounds TKNC [44] The ratio of top-𝑘 activated neurons BKNC [79] The ratio of bottom-𝑘 activated neurons NLP-based BLEU [56] The overlaps of 𝑛-grams Meteor [12] 𝑛-grams with synonyms in WordNet [50] InferSent [11] BiLSTM-based embedding model SBERT [61] BERT-based embedding model SimCSE [18] Embedding model with contrastive learning BERTScore [86] Token matching in BERT embedding space four perspectives, including Naturalness, Consistency, Human La- bel, and Difficulty: • Consistency: From “1 strongly disagree” to “5 strongly agree”, how much do you think the two sentences have the same mean- ing? Consistency quantifies the semantic similarity between the original text and the changed text. • Naturalness: From “1 very bad” to “5 very good”, how fluent and natural do you think this sentence is? Naturalness measures the fluency and grammar of the examples, including grammar errors, punctuation errors, and spelling errors (unrecognizable words). • Human label: Ask humans to do the tasks of the given datasets. It is a task-specific question and records the human judgment of classification answers. • Difficulty: From “1 very easy” to “5 very hard”, how difficult for you to make the decision? Difficulty reflects how difficult the task is for humans. Based on our definition in Sec. 1, high-quality text cases should have high naturalness and consistency scores. Human label and difficulty are used to classify the human evaluation results. We also ask annotators whether these test cases have other problems/issues that we have not identified. The responses show that Inconsistent issue and Unnatural issue can cover all their concerns. Crowdsourcing. We distribute our questionnaire on Qualtrics5, a platform to design, share, and collect questionnaires. We recruit crowd workers on Prolific6, a platform to post tasks and hire work- ers. Since our questions require a high level of reading comprehen- sion and inference skills in English, we require Prolific workers to have a bachelor’s degree or above and have English as their first and most fluent language. Since we focus on false alarms, we randomly sample 100 test cases per dataset that are reported as software er- rors for human evaluation. In total, we choose 500 test cases and generate 2,000 questions. For each question, we ask three workers to give their judgment to reduce the variance. Therefore, we ask 150 workers to complete all questionnaires. It takes each worker 15-25 minutes to answer around 40 questions in a questionnaire, 5https://www.qualtrics.com/ 6https://prolific.co/ and each worker is paid about 5 pounds per hour. The total cost is 300 pounds. 4.3 Baselines We select diverse test case evaluation metrics as baselines from two categories: Neuron Coverage (NC) metrics and NLP-based metrics, which are summarized in Table 4. NC-based. NC and its variants are commonly-used for evalu- ating test cases. Different from AEON, NC-based metrics mainly aim at the evaluation of a test set instead of a test case. In our experiments, we consider NC-based metrics in two ways. (1) For basic Neuron Coverage (NC) [57] and Neuron Boundary Coverage (NBC) [44], we calculate the NC scores of each generated test case. (2) For Top-𝑘 Neuron Coverage (TKNC) [44] and Bottom-𝑘 Neuron Coverage (BKNC) [79], they cannot be adapted to a single test case (e.g., TKNC produces the same coverage for single test cases), thus we compute the number of neurons covered by the generated test case but not by the original text. Intuitively, changes in texts may be reflected in neuron activation. Note that the comparison with NC-based metrics is not apples-to-apples because NC-based metrics mainly evaluate the quality of a test set. We include the comparison here for the completeness of our discussion. NLP-based. Since the main reason behind false alarms is that the generated test cases cannot keep equivalent or similar seman- tic meaning with the original text, we include multiple semantic similarity metrics for the baselines of SemEval. Evaluating the se- mantic similarity of texts has long been a complex problem in NLP research. Previous metrics can be divided into corpus-based, knowledge-based, and DNN-based. DNN-based metrics outperform other methods and have served as a breakthrough in semantic sim- ilarity research [7]. We consider a corpus-based metric, BLEU, a knowledge-based metric, Meteor, and four DNN-based metrics, In- ferSent, SBERT, SimCSE, and BERTScore. For embedding models, namely InferSent, SBERT, and SimCSE, we report the semantic similarity based on cosine similarity because it is used by most of the researchers [18, 20, 86] and Euclidean distance yields similar results in all our experiments. 4.4 Evaluation Criteria We compute three criteria: AP, AUC, and PCC, to discover the correlation between human judgment (Sec. 4.2) and the automatic evaluation metrics, including AEON. We treat the scoring systems as binary classification systems, the human judgment as ground truth, and draw their Precision-Recall curve (P-R curve) and Re- ceiver Operating Characteristic curve (ROC curve) to calculate AP and AUC. P-R curve shows the trade-off between recall (i.e., true positive rate) and precision, while the ROC curve depicts the trade- off between true positive rate and false positive rate. AP and AUC represent the area under P-R curve and ROC curve, respectively. An excellent binary classification system tends to have high AP and AUC scores. Then, we check whether our scores are correlated with human judgment using PCC, the covariance of two variables divided by the product of their standard deviations, which can be written in the form of: 𝑃𝐶𝐶(𝑋, 𝑌) = 𝐶𝑜𝑣(𝑋, 𝑌) √︁ 𝑉𝑎𝑟 (𝑋) · 𝑉𝑎𝑟 (𝑌) . (2) AEON: A Method for Automatic Evaluation of NLP Test Cases ISSTA 2022, 18-22 July, 2022, Daejeon, South Korea Inconsistent: High-Quality: Unnatural: False Alarms: 71% 57% 44% 15% Figure 1: Venn Diagram of the proportion of each vulnera- bility category (better viewed in colored mode). PCC is able to show how linearly correlated two variables are. Note that negative PCC value indicates that the two variables are negatively correlated. 5 EXPERIMENTAL RESULTS 5.1 RQ1: The Quality of Test Cases We average the consistency, naturalness, and difficulty scores as the respective final scores. We use the label that most workers agree as the final human label. For each test case, we decide whether it is an Inconsistent case or an Unnatural case based on the consistency and naturalness scores. Finally, if the human label differs from the given label, the test case is considered a false alarm. To show how severe the problem in NLP test case quality is, we draw the Venn diagram of the generated test cases based on the human annotation results. As is shown in Fig. 1, 44% of them change the label and thus are false alarms. In other words, there are only 1,435 cases triggering software errors in all 8,000 test cases. 57% of them are not natural enough, while 71% fail to preserve the semantic meaning. Only 15% of them have good language naturalness and preserve the semantic meaning, which are counted as high-quality test cases. Besides the statistical information, we have two more observations. First, though the majority of Inconsistent cases are false alarms, a few Inconsistent cases do not change the label. These test cases only account for 11%, and the rest can be categorized into the two issues. Second, bad naturalness can sometimes hurt semantic meaning, resulting in test cases that are both Inconsistent and Unnatural. This is because the unnatural part can eliminate some key information in texts and further change the semantics. Answer to RQ1: The quality of NLP test cases cannot be guar- anteed by existing testing techniques. 71% and 57% of test cases generated by existing NLP testing techniques are Inconsistent and Unnatural, respectively. 44% of test cases are false alarms, sig- nificantly degrading the effectiveness and efficiency of existing testing techniques. 5.2 RQ2: The Effectiveness of AEON Since AEON is designed to evaluate the semantic similarity and language naturalness of NLP software test cases, we assess the two modules, SemEval and SynEval, to validate the effectiveness of our approach. We use default settings for all baselines, and we select 𝑘 = 192 (one-fourth of neurons in each layer) for TKNC and BKNC. We set 𝜆1 = 0.1, 𝜆2 = 0.2 for SemEval, and 𝜙 = 0.6 for SynEval. 5.2.1 SemEval. We draw P-R curves and ROC curves for the se- mantic scores calculated by SemEval as well as the other baselines mentioned in Sec. 4.3 and consistency from human evaluation. Then we compute AP and AUC scores, which are shown in Table 5. Our method achieves higher AP and AUC values averaged on all datasets and baselines, showing the strong ability of SemEval to fil- ter out Inconsistent cases. The results also validate the effectiveness of SemEval on capturing subtle semantic changes. We calculate PCC between the semantic score and human-annotated consistency for each method. As shown in Table 5, our approach achieves about 0.92 PCC on average, which significantly outperforms all the base- lines. This shows that the score of the SemEval aligns well with human evaluation. NC-based metrics achieve decent performance and surpass many NLP-based metrics, especially on MNLI and QQP datasets, indi- cating that neuron activation patterns can reflect text semantic changes. As for NLP-based metrics, BLEU and Meteor perform the worst since they cannot handle highly overlapped texts. The BLEU and Meteor scores for text pairs are always high since most of the words in the original texts and the generated test cases are the same. DNN-based metrics cannot perform well because of three main rea- sons. (1) Word embeddings usually lack semantic information. For instance, the embeddings of [reject] and [accept] calculated by BERT [13] have high cosine similarity of 0.846, while such word substitution changes the correct label in sentiment analysis. An- other example is that the cosine similarity between embeddings of [Tom] and [Jack] is 0.978, which hurts in natural language infer- ence tasks. (2) Baselines that employ token matching [86] are prone to mistakenly matching multiple words to a single word. Consider the following example: Illustrative Example 1 Original Text I do like the movie, though I did not watch it at cinema. Generated Text I do not like the movie, though I did not watch it at cinema. The word [not] in the generated text can be matched to the second [not] in the original sentence, resulting in a high similarity score. (3) Models based on contrastive learning fail due to the lack of data with subtle differences in their training set. These models are mainly trained on natural language inference datasets [11, 18], which can hardly cover the cases where two sentences have few but vital differences. Considering different datasets, in sentiment analysis tasks, AEON and all other baselines perform better on MR than on Yelp since the texts in MR in shorter and simpler. For natural language inference tasks, though MNLI is more complicated than SNLI, it is surprising ISSTA 2022, 18-22 July, 2022, Daejeon, South Korea Huang and Zhang, et al. Table 5: AP, AUC, and PCC results that show how well the automatic metrics align with human-annotated consistency scores. Datasets MR Yelp SNLI MNLI QQP Metrics AP AUC PCC AP AUC PCC AP AUC PCC AP AUC PCC AP AUC PCC NC-based NC 0.76 0.55 0.73 0.29 0.36 -0.70 0.56 0.55 0.65 0.31 0.54 0.50 0.44 0.52 0.47 NBC 0.78 0.66 0.85 0.51 0.62 0.85 0.44 0.47 0.23 0.35 0.56 0.90 0.38 0.56 0.78 TKNC 0.88 0.73 0.88 0.56 0.64 0.88 0.58 0.57 0.65 0.47 0.66 0.87 0.67 0.76 0.91 BKNC 0.88 0.73 0.86 0.55 0.63 0.85 0.58 0.58 0.67 0.48 0.67 0.87 0.64 0.75 0.88 NLP-based BLEU 0.75 0.58 0.75 0.45 0.57 0.45 0.45 0.42 -0.36 0.34 0.50 0.44 0.44 0.54 0.44 Meteor 0.74 0.54 0.73 0.43 0.57 0.41 0.51 0.50 -0.18 0.46 0.59 0.80 0.53 0.62 0.81 InferSent 0.81 0.68 0.96 0.55 0.67 0.88 0.55 0.61 0.55 0.40 0.61 0.93 0.48 0.59 0.83 SentBERT 0.85 0.71 0.92 0.53 0.63 0.83 0.65 0.65 0.66 0.46 0.63 0.74 0.52 0.59 0.76 SimCSE 0.85 0.75 0.93 0.63 0.76 0.92 0.61 0.65 0.76 0.54 0.73 0.96 0.50 0.56 0.68 BERTScore 0.75 0.61 0.91 0.44 0.53 0.37 0.59 0.60 0.20 0.34 0.58 0.73 0.46 0.59 0.67 AEON (Ours) 0.92 0.82 0.93 0.75 0.84 0.91 0.66 0.70 0.85 0.58 0.75 0.98 0.56 0.63 0.90 Table 6: AP, AUC, and PCC results that show how well the automatic metrics align with human-annotated naturalness scores. Datasets MR Yelp SNLI MNLI QQP Metrics AP AUC PCC AP AUC PCC AP AUC PCC AP AUC PCC AP AUC PCC NC 0.66 0.56 0.53 0.64 0.41 -0.70 0.45 0.54 -0.09 0.56 0.51 0.27 0.56 0.52 0.03 NBC 0.72 0.57 0.32 0.70 0.47 0.51 0.32 0.52 0.66 0.43 0.36 -0.91 0.52 0.47 -0.61 TKNC 0.68 0.56 0.57 0.67 0.53 0.49 0.36 0.51 0.09 0.55 0.47 -0.33 0.65 0.57 0.51 BKNC 0.67 0.55 0.52 0.66 0.52 0.45 0.36 0.51 0.09 0.55 0.47 -0.39 0.64 0.55 0.46 AEON (Ours) 0.87 0.59 0.66 0.77 0.58 0.84 0.52 0.63 0.75 0.54 0.65 0.98 0.73 0.68 0.87 to observe that SNLI has lower PCC scores than MNLI. There are negative PCC scores when using BLEU and Meteor, indicating the negative correlation between the baselines and human evaluation. We think the reason behind this is that the complexity of MNLI lies in the diversity of contexts, and changes in contexts typically will not change the corresponding labels. 5.2.2 SynEval. To evaluate the performance of SynEval, we draw P-R curves and ROC curves and compute AP and AUC, treating SynEval as a binary classifier to recognize Unnatural cases. We also calculate PCC between SynEval and naturalness score from human evaluation, which is included in Table 6. We can observe that though NC-based metrics have an excellent performance on detecting Inconsistent test cases, they fall short of measuring lan- guage naturalness. The performance varies significantly on differ- ent datasets. The PCC scores of NBC, TKNC, and BKNC show a negative correlation on MNLI, while positively correlated on other datasets. We infer the reason behind this may be that the models make the decision based on the appearance of certain words or phrases, ignoring whether the input texts have good language natu- ralness. In addition to using BERT [13] in SynEval that is presented in Table 6, we also try other language models including RoBERTa [43] and ALBERT [37], among which BERT achieves the highest AUC and AP values, averaged on all datasets. Note that traditional grammar checkers are not suitable for this task because they do not provide quantitative results, and they cannot reveal the error-free yet strange sentences that people rarely write. The impact of the hyperparameters. If we set the proportion of the minimal semantic score from 0 to 1, we can observe that the performance increases at first, then remains stable at the same level, and finally drop when it gets close to 1. We balance this trade-off using a grid search for lambdas and phis. These parameters can be generalized to other datasets and NLP tasks since we adopt the same parameters and consistently achieve good performance for all selected datasets in our experiments. We also test different patch lengths 𝑙 for extracting [𝑖 − 𝑙 : 𝑖 + 𝑙]. In particular, 𝑙 = 1 does not work well because most NLP models use BPE (Byte Pair Encoding) [65] for tokenization, which may divide a word into smaller tokens, making it extract only part of a word. Long patches (e.g., 𝑙 ≥ 5) suffer from the same problem as average scores, i.e., the impact of mutation vanishes after averaging. In our experiments, 𝑙 = 3 and AEON: A Method for Automatic Evaluation of NLP Test Cases ISSTA 2022, 18-22 July, 2022, Daejeon, South Korea Table 7: The quality of test cases without and with AEON. Data Source Consistency Naturalness False Alarms w/o AEON 2.627 2.916 0.440 w/ AEON 3.357 3.305 0.262 Improvement ↑ 27.8% ↑ 13.3% ↓ 40.6% 𝑙 = 4 lead to similar results. To reduce computation cost, we select 𝑙 = 4 for this parameter. Answer to RQ2: AEON, which consists of SemEval and SynEval, is effective in terms of detecting test cases that change the label (false alarms) and test cases that are unnatural. AEON outperforms all baselines in average on all datasets. 5.3 RQ3: Test Case Selection Using AEON This paper aims to propose a metric that facilitates NLP software testing by evaluating the quality of test cases. In this section, we utilize AEON to filter out low-quality test cases. We conduct ex- periments to verify whether the test cases selected by AEON enjoy better semantic consistency and language naturalness. Specifically, AEON can be utilized to filter out Inconsistency and Unnatural test cases to improve the quality of test cases in average. For SemEval, we set different thresholds for different tasks. We choose multiple thresholds for semantic similarity score because whether the label will change depends on the given task. Consider this pair of texts (original and generated) which is inconsistent in semantics: Illustrative Example 2 Original Text I watched the movie at home, it was nice. Generated Text I watched the movie at cinema, it was nice. If they are in a sentiment analysis dataset, the label remains un- changed (i.e., positive). However, if they appear in a natural lan- guage inference dataset as premises, and the hypothesis is “I went out for the movie”, the label changes from contradiction to en- tailment. Therefore, to best filter out those false alarms, we set thresholds as 0.87, 0.90, and 0.91 for sentiment analysis, natural language inference, and semantic equivalence, respectively. The thresholds are computed with a balance between true positive rate and false positive rate. From the thresholds, we can see that the three tasks need more semantic similarity increasingly to ensure the preservation of labels, which aligns with the characteristics of the datasets. For language naturalness, we set the threshold as 0.21. We generate 500 test cases that are reported to trigger some software errors, including various datasets and testing techniques mentioned in Table 3 and Table 2 respectively. Then we check whether human evaluation has improved before and after applying AEON to select high-quality test cases. The results are shown in Table 7. The average consistency and naturalness scores of the 500 test cases are 2.627 and 2.916, which are below 3 (considered as Inconsistency and Unnatural test cases) in average. The false alarm rate is 0.44. After selecting test cases whose SemEval and SynEval scores are above the thresholds with the help of AEON, the qual- ity of test cases is significantly enhanced. The scores increase to 3.357 and 3.305, considered high-quality test cases on average. The consistency score improves by 27.8%, and the naturalness score im- proves by 13.3%. The false alarm rate is 26.2%, showing a significant improvement of 40.6%. The results demonstrate the effectiveness of AEON on high-quality test case selection. 5.3.1 Case Study. We choose one of the generated test cases as an example to illustrate the performance of our SemEval and SynEval compared to other baselines. Case Study 2 Task/Dataset SA/MR Techniqe PSO Original Text The result is a powerful, naturally dramatic piece of low-budget filmmaking. Generated Text The result is a terrible, naturally dramatic piece of low-budget filmmaking. AEON achieves a semantic score of 0.58 and a syntactic score of 0.22. From the semantic side, the sentiment of the original example is positive. However, the sentiment of the generated test case is negative because [terrible] is a negative adjective. This test case is not only Inconsistent but also a false alarm since the label of this example is changed. Therefore, an excellent semantic metric should give this test case a low score to filter it out. Our method, SemEval, gives the text pair a score of 0.58, which is far below the threshold of 0.87, indicating that the test cases cannot preserve the semantics and should be filtered out. AEON works effectively because our design to consider patch similarity identifies that the substitution ([powerful]→[terrible]) dramatically changes the semantic meaning. From the syntactic side, the generated test case reads smoothly without difficulty comprehending its meaning, sug- gesting that it has good language naturalness. The case obtains a score of 0.22 given by SynEval, which is above the threshold of 0.21 and indicates that the test case is not an Unnatural case. All in all, our method outperforms other baselines both in semantic and syntactic perspectives on this example. Answer to RQ3: AEON can effectively filter out low-quality test cases. The remaining test cases enjoy better semantic consistency and language naturalness and, most importantly, a lower false alarm rate. Thus, AEON can facilitate NLP software testing by selecting high-quality test cases, saving developers’ time. 5.4 RQ4: Improving NLP Software with AEON Although NLP software testing is a promising research direction, it incurs an important yet unavoidable question: can the test cases be utilized to improve NLP software? To further show how high- quality test cases selected by AEON can help in improving NLP software, we add test cases that the model misclassifies to the training set and conduct model re-training. Accuracy is verified on the test set of the given task, while robustness is evaluated using the success rate of adversarial attacks. In this section, we ISSTA 2022, 18-22 July, 2022, Daejeon,...