## 1 Introduction

Recent years have witnessed remarkable progress in retrieval-based open-domain conversation systems [6, 3]. In the past few years, various methods have been proposed for response selection [3, 16, 22, 1]. A key problem in response selection is how to measure the matching degree between a conversation context and a response candidate. Many efforts have been made to construct an effective matching model with neural architectures [16, 22].

To construct the training data, a widely adopted approach is pairing a positive response with several randomly selected utterances as negative responses, since the labeling of true negative responses is very time-consuming. Although such method does not require labeled negative data, it is likely to bring noise during the random sampling process for negative responses. In real-world datasets, a randomly selected response is likely to be “*false negative*”, in which the sampled response can reply to the last-utterance but is considered as a negative response. For example, the general utterance “OK!” or “It’s great.” can safely respond to many conversations.
As shown in existing studies [15, 7, 1], the noise from random sampling will severely affect the performance of the matching model.

However, we do not have any labeled data related to true negative samples. To address this difficulty, we find inspiration from the recent progress made in
complementary learning [17, 14].
We design a main-complementary task pair. As shown in Figure 1, the left side is the main task (*i.e.,* our focus) which selects the correct response given the last utterance and context, while the right side is the complementary task which selects the last utterance given the response and context.
To implement such a connection, we derive a weighted margin-based optimization objective for the main task. This objective is general to work with various matching models. It elegantly utilizes different prospects in utterance selection, either last-utterance selection or response selection. The main task is assisted by the complementary task, and finally, its performance is improved.

To summarize, the major novelty lies in that the proposed approach can capture different supervision signals from different perspectives, and it is effective to reduce the influence of noisy data. The approach is general and flexible to apply to various deep matching models. We conduct extensive experiments on two public data sets, and experimental results on both data sets indicate that the models learned with our approach can significantly outperform their counterparts learned with other strategies.

## 2 Related Work

Recently, data-driven approaches for chatbots [9, 3] have achieved promising results. Existing work can be categorized into generation-based methods [9, 11, 6, 20] and retrieval-based methods [3, 18, 21].The first group of approaches learn response generation from the data. Based on the sequence-to-sequence structure with attention mechanism [11], multiple extensions have been made to tackle the “safe response” problem and generate informative responses [6, 20]. The retrieval-based methods try to find the most reasonable response from a large repository of conversational data [3, 16]. Recent work pays more attention to context-response matching for multi-turn response selection [18, 16, 22].

Instance weighting is a semi-supervised approach proposed by Grandvale et al. [2]. The key idea is to utilize weighted margin-based optimization to train the model with a weight function to produce a reward for each instance. Then, researchers used this method to promote the model in noisy training data [8], and extended this method to other tasks [4, 1]

. A recent work showed that the instance weighting strategy can be extended to different machine learning models and validated the improvement in different tasks.

Our work is inspired by the work of using new learning strategies to distinguish the noise in training data [10, 15, 7]. Shang et al. [10] and Lison et al. [7] utilized instance weighting strategy in open domain dialog systems via simple methods. Wu et al. [15] altered the negative sampling strategy and utilized a sequence-to-sequence model to distinguish false negative samples. Feng et al. [1] proposed three co-teaching mechanisms to reduce noise.

Different from aforementioned works, we utilize the last-utterance selection task as the complementary task to assist the response selection task by computing the instance weights. This complementary task is similar to the main task since it just exchanges the last utterance with the response. Our method is similar to a dual-learning approach and the difference is that the complementary model is not optimized together with the main model but only provides the instance weights to assist the main task. Besides, the two tasks own the same neural architecture, but leverage different supervision signals from the data.

## 3 Preliminaries

We denote a conversation as , where each utterance is a conversation sentence.
A dialogue system is built to give the next utterance to reply .
We refer to the last known utterance (*i.e.,* ) as *last-utterance*, and the utterance to be predicted (*i.e.,* ) as *response*.

We assume a training set represented by , where denotes the previous utterances . and denote the last-utterance and response respectively. is a label indicating whether is an appropriate response to the entire conversation context consisting of and .

A retrieval-based dialogue system is designed to select the correct response from a candidate response pool based on the context (namely and ). This is also commonly called *multi-turn response selection task* [18, 16]

. Formally, we usually solve this task by learning a matching model between last utterance and response given the context to compute the conditional probability of

, which indicates the probability that can appropriately reply to . For simplification, we omit and represent the probability by .A commonly adopted loss for the matching model is the Cross-Entropy as:

(1) |

This is indeed a binary classification task. The optimization loss drives the probability of the positive utterance to be one and the negative utterance to be zero.

## 4 Approach

In this section, we present the proposed approach to learning matching models for multi-turn response selection. Our idea is to assign different weights to training instances, so that we can force the model to focus on confident training instances. An overall illustration of the proposed approach is shown in Figure 2. In our approach, a general weight-enhanced margin-based optimization objective is given, where the weights indicate the reliability level of different instances. We design a complementary task that is to predict last-utterance for automatically setting these weights of training instances used in the main task.

### 4.1 A Pairwise Weight-enhanced Optimization Objective

Previous methods treat all sampled responses equally, which is easily influenced by the noise in training data. To address this problem, we propose a general weighted-enhanced optimization objective. We consider a pairwise setting: each training instance consists of a positive response and a negative response for a last utterance, denoted by and . For convenience, we assume each positive response is paired with a single negative sample.

The basic idea is to minimize the *Weighted Margin-based Loss* in a pairwise way, which is defined as:

(2) |

where is the weight for the -instance consisting of and . is a parameter to control the threshold of difference. and denote the conditional probabilities of an utterance being an appropriate and inappropriate response for . When the probability of a negative response is larger than a positive one, we penalize it by summing the difference into the loss. This objective is general to work with various matching methods.

### 4.2 Instance Weighting with Last-Utterance Selection Model

A major difficulty in setting weights (shown in Equation 1

) is that there is no external supervision information. Inspired by the recent progress made in self-supervised learning and co-teaching

[1, 7], we leverage supervision signals from the data itself. Since response selection aims to select a suitable response from a candidate response pool, we devise a complementary task (*i.e.,*last-utterance selection) that is trained with an assistant signal for setting the weights.

#### 4.2.1 Last-Utterance Selection

Similar to response selection, here can be sampled negative utterances. The complementary task captures data characteristics from a different perspective, so that the learned complementary model can be used to set weights by providing evidence on instance importance.

#### 4.2.2 Instance Weighting

After learning the last-utterance selection model, we now utilize it to set weights for training instances. The basic idea is if an utterance is a proper response, it should well match the real last-utterance . On the contrary, for a true negative response, it should be uninformative to predict the last-utterance. Therefore, we introduce a new measure to compute the degree that an utterance is a true positive response as:

(3) |

where and are the conditional probabilities of and learned by the last-utterance selection model. In this way, a false negative response tends to yield a large value, since it is able to reply to and contains useful information to discriminate between and . With this measure, we introduce our solution to set the weights defined in Eq. 2. Recall that a training instance is a pair of positive and “negative” utterances, and we want to assign a weighted score indicating how much attention the response selection model should pay. Intuitively, a good training instance should be able to provide useful information to discriminate between positive and negative responses. We define the instance weighting formula as:

(4) |

where is a parameter to adjust the mean value of weights, and we constrain the weight to be less equal to 1. From this formula, we can see that a large weight tends to correspond to a large (a more informative positive response) and a small (a less discriminative negative utterance).

### 4.3 Complete Learning Approach and Optimization

In this part, we present the complete learning approach.

#### 4.3.1 Instantiation of the Deep Matching Models

We instantiate matching models for response selection. Our learning algorithm can work with any deep matching models. Here, we consider two recently proposed attention-based matching models, namely SMN [16] and DAM [22]

. The SMN model is an RNN-based model. It first constructs semantic representations for context and response by GRU. Then, the matching features are captured by word-level and sequence-level similarity matrix. Finally a convolution neural network is adopted to distill important matching information as a matching vector and an utterance-level GRU is used to compute the matching score. The DAM model is a deep attention-based model which constructs semantic representation for context and response by a multi-layer transformer. Then, the word-level matching features are captured by cross-attention and self-attention layers. Finally a 3D-convolution is adopted to compute the matching score. These two models are selected due to their state-of-the-art performance on multi-turn response selection. Besides, previous studies have also adapted them with techniques such as weak-supervised learning

[16] and co-teach learning [1].#### 4.3.2 Learning and Optimization

Given a matching model, we first pre-train it with the cross-entropy in Equation 1. This step aims to obtain a basic model that will be further fine-tuned by our approach. For each instance consisting of a positive and a negative response, the last-utterance selection model computes the value for each response by Equation 3. Then, the weights are derived by Equation 4 and utilized in the fine-tuning process by Equation 2. The gradient will back-propagate to optimize the parameters in the response selection model (the gradient to last-utterance selection model is obstructed). This training approach encourages the model to focus on more confident instances with the supervision signal from the complementary task.

#### 4.3.3 Discussions

In addition to the measure defined in Equation 4, we consider using other alternatives to compute

, such as Jaccard similarity and embedding cosine similarity between positive and negative responses. Indeed, it is also possible to replace our multi-turn last-utterance selection model with a single-turn last-utterance selection model to reduce the influence of the context information. Currently, we do not fine-tune the last-utterance selection model, since there is no significant improvement from this strategy in our early experiments. More details will be discussed in Section

5.3.## 5 Experiment

In this section, we first set up the experiments, and then report the results and analysis.

### 5.1 Experimental Setup

#### 5.1.1 Construction of the Datasets

To evaluate the performance of our approach, we use two public open-domain multi-turn conversation datasets. The first dataset is Douban Conversation Corpus (Douban) which is a multi-turn Chinese conversation data set crawled from Douban group^{1}^{1}1https://www.douban.com/group/explore. This dataset consists of one million context-response pairs for training, 50,000 pairs for validation, and 6,670 pairs for test. Another dataset is E-commerce Dialogue Corpus (ECD) [19]. It consists of real-world conversations between customers and customer service staff in Taobao^{2}^{2}2https://www.taobao.com/. There are one million context-response pairs in the training set, and 10,000 pairs in both the validation set and the test set. For both datasets, the negative responses in the training set and the validation set are randomly sampled and the ratio of the positive and the negative is 1:1^{3}^{3}3In the released training data of ECD, negative ones are automatically collected by ranking the response corpus based on conversation history augmented messages using Apache Lucene. Because retrieval negative samples from the index will bring more noisy data, we reconstruct the negative responses by random sampling from the training data. We also conduct experiments on the original training data and witness less promote than our rebuilt training data.. In the test set, each context has 10 response candidates retrieved from an index whose appropriateness regarding to the context is judged by human annotators.

#### 5.1.2 Task Setting

We implement our method as 4.3. We select DAM [22] and SMN [16] as response selection models. We only select DAM [22]

as our last-utterance selection model not only due to its strong feature extraction ability, but also for guaranteeing the gain only comes from the response selection model. The pre-training process follows the setting in

[22, 16]. During the instance weighting, we choose 50 as the size of the mini-batches. We use Adam optimizer [5] with the learning rate as 1e-4. All gradients are clipped by 1.0 to stabilize the training process. We tune in {0,1/8,2/8,3/8,4/8}, and finally choose 2/8 for Douban dataset, 4/8 for ECD dataset. And we test in {0,1/4,2/4,3/4}, and find 2/4 is the best choice for both datasets.Following the works [16, 22], we use *Mean Average Presion* (MAP), *Mean Reciprocal Rank* (MRR) and *Precision at position 1*

(P@1) as evaluation metrics.

#### 5.1.3 Baseline Models

We combine our approach with SMN and DAM to validate the effect. Besides, we compare our models with a number of baseline models:

SMN [16] and DAM [22]: We utilize the pre-training results of the two models as baselines to validate the promotion of our proposed method.

Single-turn models: MV-LSTM [12] and match-LSTM [13] are the typical single-turn matching models. They concatenate all utterances in contexts as a long document for matching.

Multi-view [21]: It measures the matching degree between a context and a response candidate in both a word view and an utterance view.

DL2R [18]: It represents each utterance in contexts by RNNs and CNNs, and the matching score is computed based on the concatenation of the representations.

In addition to these baseline models, we denote the model with our proposed weighting method as Model-WM.

### 5.2 Results and Analysis

Dataset | Douban | ECD | ||||
---|---|---|---|---|---|---|

Models | MAP | MRR | P@1 | MAP | MRR | P@1 |

MV-LSTM | 0.498 | 0.538 | 0.348 | 0.613 | 0.684 | 0.525 |

Match-LSTM | 0.500 | 0.537 | 0.345 | - | - | - |

Multiview | 0.505 | 0.543 | 0.342 | - | - | - |

DL2R | 0.488 | 0.527 | 0.330 | 0.604 | 0.661 | 0.489 |

SMN | 0.530 | 0.569 | 0.378 | 0.666 | 0.739 | 0.591 |

SMN-WM | 0.550* | 0.589* | 0.397* | 0.670 | 0.749* | 0.612* |

DAM | 0.551 | 0.598 | 0.423 | 0.683 | 0.756 | 0.621 |

DAM-WM | 0.584* | 0.636* | 0.459* | 0.686 | 0.771* | 0.647* |

Results on two datasets. Numbers marked with * indicate that the improvement is statistically significant compared with the pre-trained baseline (t-test with p-value

0.05). We copy the numbers from [16] for the baseline models. Because the first four baselines obtain similar results in Douban dataset, we only implement two of them in ECD dataset.We present the results of all comparison methods in Table 1. First, these methods show a consistent trend on both datasets over all metrics, i.e., DAM-WM DAM SMN-WM SMN other models. We can conclude that DAM and SMN are the best baselines in this task than other models because they can capture more semantic features from word-level and sentence-level matching information. Second, our method yields improvement in SMN and DAM on two datasets, and most of these promotions are statistically significant (t-test with p-value 0.05). This proves the effectiveness of our instance weighting method.

Third, the promotion on Douban dataset by our approach is larger than that on ECD dataset. The difference may stem from the distribution of test sets of the two data. The test set of Douban is built from random sampling, while that of the ECD dataset is constructed by a response retrieval system. Therefore, the negative samples are more semantically similar to the positive ones. It is difficult to yield improvement by our approach with SMN and DAM in ECD dataset. Fourth, our method yields less improvement in SMN than DAM. A possible reason is that DAM fits our method better than SMN because DAM is a deep attention-based network, which owns stronger learning capacity. Another possible reason is that DAM is less sensitive to noisy training data since we have observed that the convergence process of SMN is not as stable as DAM.

### 5.3 Variations of Our Method

Method | Models | MAP | MRR | P@1 |
---|---|---|---|---|

Original | DAM | 0.551 | 0.598 | 0.423 |

Heuristic | DAM-uniform | 0.577 | 0.623 | 0.433 |

DAM-random | 0.549 | 0.594 | 0.399 | |

DAM-jaccard | 0.572 | 0.622 | 0.438 | |

DAM-embedding | 0.573 | 0.615 | 0.426 | |

Model-based | DAM-DAM | 0.580 | 0.627 | 0.438 |

DAM-last-WM | 0.578 | 0.625 | 0.439 | |

DAM-dual | 0.579 | 0.621 | 0.430 | |

Ours | DAM-WM | 0.584 | 0.636 | 0.459 |

In this section, we explore a series of variations of our method. We replace the multi-turn last-utterance selection with other models or replace the weight produced by Equation 4 with other heuristic methods. In this part, our experiments are conducted on Douban dataset with DAM[22] as our base model.

#### 5.3.1 Heuristic Method

We consider the following methods, which change the weight produced by Equation 4 with heuristic methods.

DAM-uniform: we fix the weight as one and follow the same procedure of our learning approach, to validate the effectiveness of our dynamic weight strategy.

DAM-random: we replace the weight model as a random function to produce random values varied in [0,1].

DAM-Jaccard: we use the Jaccard similarity between positive response and negative response as the weight.

DAM-embedding [7]: we use the cosine similarity between the representation of positive and negative response as the weight. For DAM model, we calculate the average hidden state of all the words in the response as its representation.

#### 5.3.2 Model-based Method

We consider the following methods, which change the computing approach of in Equation 3 by substituting our complementary model with other similar models.

DAM-last-WM replaces the multi-turn last-utterance selection model with a single-turn last-utterance selection model. This method is used to prove the effectiveness of the context information in the last-utterance selection model.
DAM-DAM replaces the last-utterance selection model by a response selection model. We utilize DAM model to produce and .

DAM-dual is a prime-dual approach. The response selection model is the prime model and the last-utterance selection model is the dual model. The two approaches learn instance weights for each other as Equation 2.

#### 5.3.3 Result Analysis

Table 2 reports the results of these different variations of our method on Douban dataset. First, most of these variants outperform DAM model. It demonstrates that these instance weight strategies are effective in noisy data training. Among them, DAM-WM achieves the best results for all the three evaluation metrics. It indicates that our proposed method is more effective. Second, the improvement yielded by heuristic methods is less than model-based methods. A possible reason is that neural networks own stronger semantic capacity and the weights produced by these models can better distinguish noise in training data. Third, heuristic methods achieve worse performance than DAM-uniform. It indicates that Jaccard similarity and cosine similarity of representation are not proper instance weighting functions and bring a negative effect on response selection model.

Moreover, all these model-based methods receive similar results in all three metrics and outperform DAM model. It indicates that these methods are effective but not as powerful as our proposed method. For DAM-DAM model, a possible reason is that it cannot provide more useful signal for this task than our proposed method. For DAM-last-WM, its last-utterance selection model only utilizes the last utterance therefore it cannot select positive last-utterance confidently^{4}^{4}4The last-utterance selection model of DAM-WM obtains 0.846 in P@1 metric while the one of DAM-last-WM only obtains 0.526. The distribution of positive and negative in test data is 1:9, therefore the distinguish ratio becomes noisy and low confident. For DAM-dual model, we observe that the dual-learning approach does not improve the performance of the last-utterance selection task, the reason may be that the response selection task and last-utterance selection task are not an appropriate dual-task or the dual-learning approach is not proper. We will conduct further investigation to find an appropriate dual-learning approach for this task.

### 5.4 Case Study

Weight | 0.0 | 1.0 |
---|---|---|

1st Utterance | Girls shouldn’t be too thin, so I gain weight successfully. | You can make a Urban Poster. |

2nd Utterance | I am 1.63 meters tall and about 94 kilos, is it too thin? | Nice idea. |

Last Utterance | It is just in the right places. | Hello, online celebrity. |

Pos Response | I am small boned and look thinner, so the people around me always laugh at me. ( ) | I‘m not online celebrity. ( ) |

Neg Response | Haha, I think so.( ) | If you carry too many things, please think over again. ( ) |

Previously, we have shown the effectiveness of our method. In this section, we qualitatively analyze why our method can yield good performance.

We calculate the weights of all the instances in training data of Douban dataset, and select the instances with maximum and minimum weight (1.0 and 0.0) respectively. We present some of them in Table 3 and annotate them manually. The first case receives a weight of 0.0, which demonstrates that the case is identified as inappropriate negative case by our last-utterance selection model. The last case receives a weight of 1.0, and we can identify the positive and negative responses. This case study shows that our instance weighting method can identify the false negative samples and punish them with less weight.

## 6 Conclusion and Future Work

Previous studies mainly focus on the neural architecture for multi-turn retrieval-based dialog systems, but neglect the fundamental problem from noisy training data. In this paper, we proposed a novel learning approach that was able to effectively reduce the influence of noisy data. We utilized a complementary task to learn the weights for training instances that were used by the main task. The main task was furthermore fine-tuned according to a weight-enhanced margin-based loss. Such an approach can force the model to focus on more confident training instances. Experimental results on two public datasets have demonstrated the effectiveness of our proposed method. As future work, we will design other instance weighting methods to detect noise in open domain multi-turn response selection task. Furthermore, we will consider combining our approach with more learning paradigms such as dual-learning and adversarial-learning.

## Acknowledgement

This work was partially supported by the National Natural Science Foundation of China under Grant No. 61872369 and 61832017, the Fundamental Research Funds for the Central Universities, and Beijing Outstanding Young Scientist Program under Grant No. BJJWZYJH012019100020098, and Beijing Academy of Artificial Intelligence (BAAI).

## References

- [1] (2019) Learning a matching model with co-teaching for multi-turn response selection in retrieval-based dialogue systems. ArXiv abs/1906.04413. Cited by: §1, §1, §2, §2, §4.2, §4.3.1.
- [2] (2004) Semi-supervised learning by entropy minimization. In NIPS, Cited by: §2.
- [3] (2014) An information retrieval approach to short text conversation. ArXiv abs/1408.6988. Cited by: §1, §2.
- [4] (2007) Instance weighting for domain adaptation in nlp. In ACL, Cited by: §2.
- [5] (2014) Adam: a method for stochastic optimization. CoRR abs/1412.6980. Cited by: §5.1.2.
- [6] (2015) A diversity-promoting objective function for neural conversation models. In HLT-NAACL, Cited by: §1, §2.
- [7] (2017) Not all dialogues are created equal: instance weighting for neural conversational models. In SIGDIAL, Cited by: §1, §2, §4.2, §5.3.1.
- [8] (2007) Class noise mitigation through instance weighting. In ECML, Cited by: §2.
- [9] (2015) Building end-to-end dialogue systems using generative hierarchical neural network models. In AAAI, Cited by: §2.
- [10] (2018) Learning to converse with noisy data: generation with calibration. In IJCAI, Cited by: §2.
- [11] (2015) A neural conversational model. CoRR abs/1506.05869. Cited by: §2.
- [12] (2016) Match-srnn: modeling the recursive matching structure with spatial rnn. ArXiv abs/1604.04378. Cited by: §5.1.3.
- [13] (2015) Learning natural language inference with lstm. ArXiv abs/1512.08849. Cited by: §5.1.3.
- [14] (2018) Iterative learning with open-set noisy labels. CVPR. Cited by: §1.
- [15] (2018) Learning matching models with weak supervision for response selection in retrieval-based chatbots. In ACL, Cited by: §1, §2.
- [16] (2016) Sequential matching network: a new architecture for multi-turn response selection in retrieval-based chatbots. In ACL, Cited by: §1, §2, §3, §4.3.1, §5.1.2, §5.1.2, §5.1.3, Table 1.
- [17] (2016) Dual learning for machine translation. In NIPS, Cited by: §1.
- [18] (2016) Learning to respond with deep neural networks for retrieval-based human-computer conversation system. In SIGIR, Cited by: §2, §3, §5.1.3.
- [19] (2018) Modeling multi-turn conversation with deep utterance aggregation. In COLING, Cited by: §5.1.1.
- [20] (2019) Unsupervised context rewriting for open domain conversation. In EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pp. 1834–1844. Cited by: §2.
- [21] (2016) Multi-view response selection for human-computer conversation. In EMNLP, Cited by: §2, §5.1.3.
- [22] (2018) Multi-turn response selection for chatbots with deep attention matching network. In ACL, Cited by: §1, §2, §4.3.1, §5.1.2, §5.1.2, §5.1.3, §5.3.

Comments

There are no comments yet.