Summary of the champion method of DSTC10 Open Domain Dialogue Evaluation Competition

This paper introduces MME-CRS, the champion method of the international competition DSTC10 open domain dialogue evaluation track. This method designs a variety of evaluation indicators, and uses the correlation renormalization algorithm to integrate the scores of different indicators, so as to design a more advanced dialogue evaluation field. Effective evaluation indicators provide a reference. The related methods have been published simultaneously in the AAAI 2022 Workshop. I hope it can give some inspiration or help to students working in this technical field.

1 Background

The Dialog System Technology Challenge DSTC (The Dialog System Technology Challenge) was launched in 2013 by scientists from Microsoft and Carnegie Mellon University. Authority and popularity. The Dialogue System Challenge has been held for the tenth year (DSTC10) this year, attracting world-renowned companies, top universities and institutions such as Microsoft, Amazon, Carnegie Mellon University, Facebook, Mitsubishi Electronics Research Laboratory, Meituan, Baidu, etc. Athletics.

DSTC10 contains a total of 5 Tracks, and each Track contains several subtasks in a certain dialogue field. Among them, Track5 Task1 Automatic Open-domain Dialogue Evaluation systematically and comprehensively introduces the automatic evaluation task of open-domain dialogue into the DSTC10 competition. Automatic evaluation of open-domain dialogue is an important part of dialogue systems, which is dedicated to automatically giving dialogue quality evaluation results that conform to human intuition. Compared with the slow and high-cost manual annotation, the automated evaluation method can score different dialogue systems efficiently and at low cost, which effectively promotes the development of dialogue systems.

Unlike task-based dialogues, which have a fixed optimization objective, open-domain dialogues are closer to real human dialogues and are more difficult to evaluate, thus attracting widespread attention. The DSTC10 Track5 Task1 competition contains a total of 14 validation datasets (with a total of 37 different dialogue evaluation dimensions) and 5 test datasets (with a total of 11 evaluation dimensions). The Meituan Voice team finally won the first place in the competition with an average correlation of 0.3104. This part of the work has completed a paper MME-CRS: Multi-Metric Evaluation Based on Correlation Re-Scaling for Evaluating Open-Domain Dialogue, and Included in the AAAI2022 Workshop.

图1 DSTC10对话系统挑战赛

2 Introduction to the competition

The Open Domain Dialogue Evaluation Competition collects classic datasets in dialogue domain papers, including 14 validation datasets (12 Turn-Level datasets and 2 Dialog-Level datasets) and 5 testing datasets.

Each conversation in the dataset mainly contains the following information:

Context: The question in the conversation, or the context of the conversation.
Response: The reply to the Context, that is, the specific object to be evaluated; the Response in the dialogue dataset is generally generated by different dialogue generation models, such as GPT-2 and T5.
Reference: Manually given reference answers to the Context, generally around 5.

Each dialogue contains multiple evaluation dimensions, such as the correlation between Context and Response, the fluency of Response itself, etc. The evaluation dimensions of each dataset are different, and the 14 validation sets contain a total of 37 different evaluation dimensions, including Overall, Grammar, Relevance, Appropriateness, Interesting, etc. Each evaluation dimension has a manually annotated score, ranging from 1 to 5, with higher scores indicating higher quality of the current evaluation dimension.

The statistics of the validation set and test set are shown in Figure 2 and Figure 3:

图2 DSTC10 Track5 Task1验证集数据统计信息

图3 DSTC10 Track5 Task1测试集数据统计信息

Turns represents the number of dialogue rounds in the corresponding dataset; Qualities represents the evaluation dimension of each dialogue in the dataset, and each evaluation dimension has a corresponding manual annotation score; Annos represents the amount of annotation in each dataset.

In this competition, each data set, each conversation and each evaluation dimension has a manually labeled score, and the score range is generally 1 to 5, and the average value is generally used for correlation calculation. Teams need to design evaluation indicators to predict the scores of different evaluation dimensions of each dialogue. The prediction score of each evaluation dimension of each data set will be calculated by Spearman correlation with the score of manual annotation, and the final competition results will be averaged based on the evaluation dimensions of all test data sets.

3 Existing methods and problems

3.1 Existing methods

There are three main categories of automatic evaluation methods for open-domain dialogue.

Overlap-based method

Early researchers compared the Reference and Response in the dialogue system to the original sentences and translated sentences in machine translation, and used the evaluation indicators of machine translation to evaluate the quality of the dialogue. The Overlap-based method calculates the word overlap between Response and Reference in the dialogue. The higher the word overlap, the higher the score. Classical methods include BLEU ^[1] and ROUGE ^[2] etc., where BLEU measures evaluation quality according to precision, while ROUGE measures quality according to recall. The evaluation of the Response depends on the given Reference, and the suitable Response in the open domain is infinite. Therefore, the Overlap-based method is not suitable for open domain dialogue evaluation.

Embedding-based method

With the rapid development of word vectors and pre-trained language models, embedding-based evaluation methods have achieved good performance. The Response and Reference are encoded respectively based on the deep model, and the correlation score is calculated based on the encoding of the two. The main methods include Greedy Matching ^[3] , Embedding Averaging ^[4] and BERTScore ^[5-6] etc. Compared with the Overlap-Based method, the Embedding-based method has a great improvement, but it also depends on the Reference, and there is still a large room for optimization.

Learning-based methods

There is a One-To-Many ^[7] dilemma in the evaluation of Open Domain Dialogue based on Reference: that is, the appropriate Response for Open Domain Dialogue is infinite, but the artificially designed Reference is limited (usually about 5). Therefore, there are great limitations in designing open domain evaluation methods based on the similarity (literal overlap or semantic similarity) between Reference and Response. Compared with the existing Overlap-based methods and Embedding-based methods, ADEM method ^[8] uses a hierarchical encoder for the first time to encode Context and Reference, and score the input Response. The ADEM method optimizes model parameters based on the mean squared error of model scoring and manual scoring, and is expected to approximate human scoring. Compared with the Overlap-based method and the Embedding-based method, the ADEM model has achieved great success, and the Learning-based method has gradually become the mainstream method for automatic evaluation in the open field.

In order to continuously improve the accuracy and comprehensiveness of dialogue evaluation, various evaluation dimensions emerge one after another. In order to cope with the challenges brought by more and more evaluation dimensions, USL-H ^[9] divides evaluation dimensions into three categories: Understandability, Sensibleness and Likeability, as shown in Figure 4. USL-H proposes three indicators: VUP (Valid Utterance Prediction), NUP (Next Utterance Prediction) and MLM (Mask Language Model), respectively, to measure the dialogue:

Whether the Response is smooth.
The degree of correlation between Context and Respose.
Is the Response itself detailed, more human-like, etc.

图4 USL-H评估算法的分层次模型

3.2 Question

The existing evaluation methods mainly have the following problems:

are not comprehensive enough to comprehensively measure the quality of dialogue.

Existing automatic evaluation methods mainly focus on some evaluation dimensions of individual datasets. Taking the current comprehensive USL-H as an example, this method considers the fluency and richness of Response and the correlation of Context-Response sentence pairs, but USL-H ignores:

More fine-grained Context-Response sentence pair topic coherence.
The respondent's engagement with the current conversation.

Experiments show that the omission of these metrics seriously affects the performance of the evaluation method. In order to evaluate multiple dialogue datasets more comprehensively and stably, it is imperative to design indicators that consider more evaluation dimensions.

Lack of effective metrics integration method

Most of the existing methods tend to design an evaluation indicator for each evaluation dimension, which is insufficient in the face of more and more evaluation dimensions (considering that the competition test set contains a total of 37 different evaluation dimensions). The evaluation of each dialogue dimension may depend on several evaluation indicators. For example, the Logical evaluation dimension requires dialogue: 1) Response is smooth; 2) Response and Context are related. Designing basic evaluation sub-indicators, and then integrating multiple sub-indicators for scoring through appropriate integration methods can more comprehensively and effectively represent different dialogue evaluation dimensions.

4 Our approach

As the evaluation indicators are not comprehensive enough, this paper designs a total of 7 evaluation indicators (Multi-Metric Evaluation, MME) in 5 categories to comprehensively measure the quality of the dialogue. Based on the designed 5 categories and 7 basic indicators, we further propose a Correlation Re-Scaling Method (CRS) to integrate the scores of different evaluation indicators. We call the proposed model MME-CRS, and the overall architecture of the model is shown in Figure 5:

图5 模型总体架构设计图

4.1 Basic Indicators

In order to solve the first problem of the existing methods, that is, the designed dialogue indicators are not comprehensive enough, we designed a total of 7 evaluation sub-indices in 5 categories in the competition.

4.1.1 Fluency Metric (FM)

Purpose : Analyze whether the Response itself is fluent and understandable.

content : First, build a response fluency dataset based on the Dailydialog dataset ^[10] . The process is as follows:

Randomly select a Response in the Dailydialog dataset and decide whether r is a positive or negative sample with a probability of 0.5.
If the sample r is a positive sample, randomly select an adjustment: a. No adjustment; b. For each stop word, delete it with a probability of 0.5.
If the sample r is a negative sample, randomly select an adjustment: a. Randomly shuffle the word order; b. randomly delete a certain proportion of words; c. randomly select some words and repeat.

After constructing the fluency dataset based on the above rules, it is fine-tuned ^[11] The fine-tuned model can calculate the Response fluency score of any dialogue, which is recorded as the FM score.

4.1.2 Relevance Metric (RM)

Purpose : Analyze the correlation between Context and Response.

Content : Build a correlation dataset in the form of Context-Response sentence pairs based on the Dailydialog dataset, where sentence pairs are correlated as positive samples and irrelevant as negative samples. The usual construction idea of negative samples is to randomly replace the Response with the Response of other dialogues. The PONE method ^[12] pointed out that the randomly selected Respose and Context are basically irrelevant, and the model training benefit is small. Therefore, the practice here is to randomly select 10 Responses, calculate the semantic correlation with the real Response, and select the sentence with the middle ranking as a pseudo sample. After constructing the dataset, fine-tune it on the SimCSE model. The fine-tuned model can be used to calculate the correlation score between Context and Response in the dialogue, which is recorded as the RM score.

4.1.3 Topic Coherence Metric (TCM)

Purpose : To analyze the subject consistency of Context and Response.

content : The GRADE method ^[13] constructs the topic-level graph representation of Context and Response, and calculates the topic-level correlation of Context and Response. Compared with the coarse-grained correlation indicators, GRADE pays more attention to the topic correlation degree at the fine-grained level, which is an effective supplement to the correlation indicators. The TCM indicator draws on the GRADE method.

specific process of as follows 161e11f830b008: First, extract the keywords in Context and Response to build a graph, where each keyword is a node, and there is only an edge between the keywords of Context and Response. Based on ConceptNet, the representation of each node is obtained, and then the graph attention network (GATs) is used to aggregate the information of keyword neighbor nodes and iterate the representation of each node. Finally, the representation of all nodes is synthesized to obtain the graph representation of the dialogue. A fully connected layer is connected to the topic-level graph representation for classification, and the fine-tuned model can be used to calculate the TCM score of the dialogue.

4.1.4 Engagement Metric (EM)

Purpose : To analyze how willing the person or dialog model that generated the Response is to participate in the current conversation.

Content : The indicators mentioned above are all to evaluate the quality of the dialogue from the perspective of Context and Response, while user engagement is based on the user's perspective. The user engagement score is generally 0 to 5. The larger the score, the greater the user's interest in participating in the current conversation. the engagement score of ConvAI dataset ^{[10] 161e11f830b04f from 1~5 to 0~1 as the engagement score dataset. The pretrained model still uses SimCSE for predicting the engagement score of the conversation. The pre-trained model can be used to predict the user engagement score of the conversation, denoted as EM.}

4.1.5 Specificity Metric (SM)

Purpose : Analyze whether the Response itself is sufficiently detailed.

Content : The SM indicator is used to avoid ambiguous responses and lack of information.

specific method is as follows : Sequence Mask to drop each Token in the Response, and calculate the Negative Log-Likelihood loss based on the MLM task of the SimCSE model, and the obtained score is called SM-NLL. The replacement loss functions are Negative Cross-Entropy and Perplexity, and SM-NCE and SM-PPL scores can be obtained respectively, and a total of 3 SM indicators are scored. The scores of the three SM indicators need to be normalized between 0 and 1, respectively.

4.2 Integrated approach CRS

Integrating the scoring of different evaluation indicators is an effective means to improve the effect of automated dialogue evaluation.

For each dialogue to be evaluated, 7 different scores can be obtained based on the above 5 categories and 7 basic indicators. For a certain evaluation dimension of the data set to be evaluated, it is necessary to comprehensively score 7 indicators to obtain a comprehensive score, which is used to calculate the correlation with human scoring. Our ensemble method is divided into the following two steps.

4.2.1 Calculation of weight distribution of different evaluation dimensions

First, the correlation scores of the seven evaluation indicators for each evaluation dimension of each dataset on the validation set are calculated. The greater the correlation score, the more important the indicator is considered to be the evaluation dimension. A larger weight is assigned to the more important evaluation index, and the obtained weight is re-normalized in the index dimension, so that the weight distribution of different evaluation indexes in each evaluation dimension of each dataset is obtained:

where $S_{ijk}$ is the correlation score of the $k$ evaluation index on the $j$ evaluation dimension of the $i$ data set, $d_{ij}$ is the power of the correlation score, $ The greater the d_{ij}$, the greater the weight of the index with higher correlation score. Generally, the integration effect is best when max($S_{ijk}^{d_{ij}}$) is between 1/3 and 1/2, which is a simple and effective way to calculate $d_{ij}$. In the experiment, setting $d_{ij}$ as a constant can achieve better generalization effect, we set $d_{ij}$ to 2, and calculate the weight distribution on the validation set, and then migrate to the test set, Achieved the best performance in the competition.

In the dataset dimension, the weights of the same evaluation dimension in different datasets are averaged to obtain the weight distribution of each evaluation dimension on different evaluation indicators:

Note that the weight distribution obtained here has nothing to do with the specific data set, and the weight distribution can be transferred to the test set.

4.2.2 Calculate the weighted sum of indicator scores

For each evaluation dimension of each test set, calculate the scores of 7 indicators and calculate the weighted sum based on the weights of the first step to obtain a comprehensive score:

The weighted comprehensive score and the manual score are used to calculate the correlation, and the correlation score between the model score and the manual score on each evaluation dimension is obtained.

Our ensemble method is weighted and renormalized based on the correlation score of the indicators, so this ensemble method is called the correlation renormalization method. Using the CRS ensemble method on the obtained MME index, the MME-CRS evaluation algorithm can be obtained.

5 Experimental analysis

5.1 Experimental results

Our method is mainly based on the Dailydialog data set pre-training (except that the EM sub-indicator uses the ConvAI2 data set), calculates the weight distribution of the ensemble method on the competition validation set, and finally achieves a Spearman correlation score of 0.3104 on the test set.

Figure 6 shows the performance of the competition benchmark model Deep AM-FM ^[14] and the top 5 teams of the competition on the test set with different dataset evaluation dimensions. Our method achieved first place with an average Spearman correlation coefficient of 0.3104, and achieved first place in 6 of all 11 evaluation dimensions in 5 datasets, proving the superior performance of our method.

图6 测试集上Top 5队伍的Spearman相关性打分对比（%）

For the convenience of display, the method in the figure adopts the display method of dataset-evaluation dimension. Among them, J, E, N, DT, and DP represent the JSALT, ESL, NCM, DST10-Topical, and DSTC10-Persona datasets, respectively, and A, C, G, and R represent the Appropriateness, Content, Grammar, and Relevance evaluation dimensions, respectively. We bold the best performance on each evaluation dimension.

5.2 Ablation experiment

In the ablation experiment part, we take the MME-CRS evaluation of the method in this paper as the benchmark, and remove the FM, RM, TCM, EM, SM, RM+TCM indicators in the integration stage, and compare the importance of different indicators in the integration process. The experimental performance is shown in Figure 7:

图7 测试集上不同评估指标的消融实验（%）

Both the correlation indicator RM and the topic consistency indicator TCM use the context and response information in the dialogue, so these two indicators are removed in the experiment to observe the impact on performance. From the experimental results in Figure 7, it can be seen that:

TCM, RM, and EM contributed the most to model performance. After removing these three evaluation metrics in the scoring integration stage, the average Spearman correlation score on the test set decreased by 3.26%, 1.56%, and 1.01%, respectively.
Coarse-grained RM metrics and fine-grained TCM metrics are beneficial and complementary to each other. If the RM or TCM indicators are removed separately, the performance will drop slightly; if the RM and TCM indicators are removed at the same time, the evaluation method lacks context-related information, and the performance will be greatly reduced to 11.07%.
The improvement of the SM indicator on the test set is basically negligible. We analyze the reason for this: each generation model used to generate Response in the test set is over-fitted to the corpus of the test set, so many very detailed responses that are not related to Context are generated. Therefore, the pros and cons of the SM index have little effect on the evaluation of the quality of the test set.

5.3 CRS effect

In order to analyze the role of the integrated algorithm CRS, this paper compares the performance of the two evaluation methods MME-CRS and MME-Avg (simple average of multiple indicators of MME), as shown in Figure 8:

图8 MME-CRS和MME-Avg在测试集上的性能对比（%）

As can be seen from the figure, the MME-CRS method is 3.49% higher than MME-Avg, which proves the superior performance of the CRS algorithm in scoring the integrated sub-indicators.

6 Summary

In this competition, we summarize two main problems in automatic evaluation of open-domain dialogues, namely, the lack of comprehensive evaluation metrics and the lack of effective metrics integration methods. In view of the problem that the evaluation indicators are not comprehensive enough, this paper designs 5 categories and 7 evaluation indicators to comprehensively measure the quality of the dialogue. Based on the 7 basic indicators, a correlation renormalization method is proposed to calculate the integrated score of each dialogue evaluation dimension. .

Although the method in this paper has achieved good results in the DSTC10 competition, we will continue to explore other more effective evaluation metrics and metrics integration methods in the future. We are trying to apply the technology in the competition to the specific business of Meituan, such as the intelligent outbound robot, intelligent marketing and intelligent customer service in the voice interaction center, and evaluate the quality of the dialogue between the machine, the human customer service and the user in many different dimensions. Optimize the dialogue effect and improve user satisfaction.

references

[1] Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W.-J. 2002. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, 311–318.

[2] Lin C Y. Rouge: A package for automatic evaluation of summaries[C]//Text summarization branches out. 2004: 74-81.

[3] Rus, V.; and Lintean, M. 2012. An optimal assessment of natural language student input using word-to-word similarity metrics. In International Conference on Intelligent Tutoring Systems, 675–676. Springer.

[4] Wieting, J.; Bansal, M.; Gimpel, K.; and Livescu, K. 2016. Towards universal paraphrastic sentence embeddings. In 4th International Conference on Learning Representations.

[5] Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K. Q.; and Artzi, Y. 2019. BERTScore: Evaluating text generation with BERT. In International Conference on Learning Representations.

[6] Liu C W, Lowe R, Serban I V, et al. How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation[C]//Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 2016: 2122-2132.

[7] Zhao, T.; Zhao, R.; and Eskenazi, M. 2017. Learning discourse-level diversity for neural dialog models using conditional variational autoencoders. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 654–664.

[8] Lowe R, Noseworthy M, Serban I V, et al. Towards an Automatic Turing Test: Learning to Evaluate Dialogue Responses[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2017: 1116-1126.

[9] Phy, V.; Zhao, Y.; and Aizawa, A. 2020. Deconstruct to reconstruct a configurable evaluation metric for open-domain dialogue systems. In Proceedings of the 28th International Conference on Computational Linguistics, 4164–4178.

[10] Zhao, T.; Lala, D.; and Kawahara, T. 2020. Designing precise and robust dialogue response evaluators. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 26–33.

[11] Gao T, Yao X, Chen D. SimCSE: Simple Contrastive Learning of Sentence Embeddings[J]. arXiv preprint arXiv:2104.08821, 2021.

[12] Lan, T.; Mao, X.-L.; Wei, W.; Gao, X.; and Huang, H. 2020. Pone: A novel automatic evaluation metric for open-domain generative dialogue systems. ACM Transactions on Information Systems (TOIS), 39(1): 1–37.

[13] Huang, L.; Ye, Z.; Qin, J.; Lin, L.; and Liang, X. 2020. Grade: Automatic graph-enhanced coherence metric for evaluating open-domain dialogue systems. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 9230–9240.

[14] Zhang, C.; D’Haro, L. F.; Banchs, R. E.; Friedrichs, T.; and Li, H. 2021. Deep AM-FM: Toolkit for automatic dialogue evaluation. In Conversational Dialogue Systems for the Next Decade, 53–69. Springer.

About the Author

Pengfei, Xiaohui, Kaidong, Wang Jian, Chunyang, etc. are all engineers of Meituan Platform/Voice Interaction Department.

Read more technical articles collection by

| In the public account menu bar dialog box, you can reply to keywords such as [2020 stock], [2019 stock], [2018 stock], [2017 stock], and you can view the collection of technical articles by the Meituan technical team over the years.

| This article is produced by Meituan technical team, and the copyright belongs to Meituan. Welcome to reprint or use the content of this article for non-commercial purposes such as sharing and communication, please indicate "The content is reproduced from the Meituan technical team". This article may not be reproduced or used commercially without permission. For any commercial activities, please send an email to tech@meituan.com apply for authorization.