ACL 2021: Do Context-Aware Translation Models Pay the Right Attention?

In Translation and proofreading

ACL 2021: Do Context-Aware Translation Models Pay the Right Attention? - read the full article about machine translation, Translation and proofreading and from Kayo Yin on Qualified.One
Kayo Yin
Youtube Blogger

Hello, Im Kayo, and today I will present  our work that tries to answer the question "Do Context-Aware Translation  Models Pay the Right Attention?" So to begin, why do we need  context during translation? Well, lets take a look at this example. Here, what  does the word "mole" in this sentence refer to? Well, if the previous sentence was "Things could  start to get dangerous if the ministers find out", then the mole probably refers to a spy. But  if the previous sentence was "Could it be anything serious, Doctor?" the mole refers to  a birthmark instead. So depending on context, the meaning of the word changes, and therefore  its translation depends on the context as well. Current Neural Machine Translation models are  reasonably good at sentence-level translation for high resource language pairs, such  as English and French. So for example, a popular provider can correctly  translate the sentence here. However, when the context changes, the correct  translation should now be "cet grain de beauté" for the mole, but the model does not pick up  on the change in meaning. So current models often fail to produce adequate translations on  a document level when there are ambiguous words that require context to resolve. Another example  is for the translation of the neutral English pronoun "they", where depending on the context,  here "they" refers to "implications" which is a feminine noun in French, so the pronoun should be  the feminine "elle" instead of the masculine "il".

To address the difficulties in document level  translation and the importance of context, several methods over the last four or five years have been  proposed to incorporate context in Neural Machine Translation. But even with the necessary context,  these models perform poorly on translating relatively simple discourse phenomena, such as  anaphoric pronouns. In this example, "it" refers to "report", which is a masculine noun in French,  so the pronoun should be the masculine "il". To try to find out why the model made this error,  lets take a look at which tokens the model paid the most attention to, which we highlight in  yellow here. We can see that the model pays high attention to the word "infirmary", which  is a feminine noun when translated into French, but the model does not pay attention to  "report" or "rapport", which would have helped it translate the word accurately. This  may explain why the model made this error.

In general, context-aware machine translation  models have been found to often attend to uninformative tokens in context, or do not use  the information contained in the context at all. We, therefore, ask ourselves the  following research questions: First, in context-aware translation, what context  is useful to disambiguate hard translations such as ambiguous pronouns or word senses? Two, are  context-aware machine translation models paying attention to the relevant context or not?  Three, if not, can we encourage them to do so? First, we conducted a user study to  collect the supporting context words that human translators use for disambiguation. We asked 20  professional English-French translators to select the correct translation and then highlight all the  supporting context words that they used to answer. We performed this study for two tasks: first,  pronoun anaphora resolution, where they choose the correct gendered French pronoun that is associated  with a neutral English pronoun, then word-sense disambiguation, where the translator chooses the  French translation for a polysemous English word.

We gave translators varying amounts of the  previous sentences in the English source side and/or the French target side as context,  and we analyzed when translators are able to answer accurately and with high confidence  depending on how much and what context was given. We also analyzed the supporting context  words that have been selected by translators, by looking at where these words are: is it in  the current sentence or three sentences before? Whether it is an English source or a French target  word, and then its features such as the part of speech and syntactic dependencies. You can look at  our paper for the full analysis and the results.

Our main findings are that for pronouns, the  previous context sentences are the most useful, especially on the target side, and we find that  humans especially rely on the pronoun antecedent, or in other cases the other reference  of the pronoun in the target side. The same coreference chain in the English side  is not as useful, because the chain in French can carry information about gender whereas in English  it does not carry any information about gender. Now, during word-sense disambiguation, the current  sentence in either language is often sufficient. For example, "charme" in French means the quality  of being charming, while "porte-bonheur" is a good luck charm. We find that humans often use words  that can indicate the role or meaning of the polysemous word. Moreover, the source and target  side often contain an equal amount of semantic load which is used for word-sense disambiguation,  which is why either side seems to be as useful. After our user study, we also annotated  the supporting context for 14 thousand examples of pronoun anaphora resolution in  English-French, and we release the SCAT dataset.

Next, to evaluate whether models pay  attention to the relevant context, we quantify how much model  attention is aligned with SCAT. For our experiments, we use the  standard Transformer translation model, but instead of only taking the data sentence by  sentence as we do for sentence-level translation, we incorporate the five previous source  and target sentences as the context by concatenating them to the current sentence,  that is then fed into the model. We use 14 million parallel English-French sentences  from the OpenSubtitles dataset for training. To quantify the alignment between the humans and  the models attentions, we construct vectors that represent the SCAT annotations and the model  attentions while translating the ambiguous pronoun. Then, taking this SCAT vector and the  model attention vector, we first sort the tokens by decreasing model attention weights. Then,  we look for the rank of the first supporting context token from SCAT in the sorted vector.  In this example, the alignment score becomes 2, and the more attention the model assigns to the  supporting context, the lower the alignment score. We also used two other additional alignment  metrics that you can find in our paper :) Using this metric, we compare the alignment score  of a uniform distribution with the alignment score of our model attention. For the model attention,  we measure alignment with SCAT for the encoder self-attention, the decoder cross-attention,  and the decoder self-attention. We find that the alignment between the encoder self-attention and  SCAT is slightly better than the alignment score of a uniform distribution, but attentions and the  decoded layers especially have very low alignment. In general, context-aware translation models do  not seem to pay attention to the relevant context. We, therefore, use SCAT to try to  increase the model-human alignment. We train a context-aware model on  OpenSubtitles with the standard negative log-likelihood loss. We additionally  sample from SCAT during training and we introduce the attention regularization  loss to supervise the model attention.

We measure model performance using  corpus-level BLEU and COMET. However, words such as ambiguous pronouns represent  only a small portion of all words in data, so corpus-level metrics such as BLEU and  COMET may not clearly capture improvements in translating discourse phenomena that are still  very important for document-level translation. We therefore also compute the mean word F-measure  of the translations of the ambiguous pronouns with respect to the reference pronouns,  and we also perform contrastive evaluation, where we measure how often the  model assigns a higher probability to the correct translation than a translation  where the ambiguous pronoun is incorrect.

We find that attention regularization  improves translation across all metrics and especially on metrics that are targeted  to pronouns. We can conclude that regularizing attention with SCAT can effectively  improve ambiguous pronoun translation. We also find that models with attention  regularization obtain better attention alignment with SCAT. We can also see that  the model with attention regularization assigns higher attention to the words  "report" and "rapport" in this example while translating the ambiguous pronoun, and  then is able to translate the pronoun correctly. This suggests that models that attention  regularization with SCAT can encourage models to pay the right attention and thus allowing  them to translate ambiguous words correctly.

Our paper contains more experiments that  demonstrate that models with attention regularization with SCAT rely more on the  supporting context that has been selected by humans, and that regularizing the encoder  self-attention gives the largest improvements in translation performance compared to regularizing  other types of model attention. Performance on word-sense disambiguation does not improve much  when we supervise the model attention using human rationales for pronoun anaphora resolution. To  summarize, we asked humans to tell us what context is useful to translate ambiguous words and we  collected a corpus of 14 000 supporting context. Then, we use the SCAT dataset to measure  alignment between human and model attention, and we find that previous context-aware  models have very low alignment. We, therefore, use SCAT to regularize attention  in context-aware translation models and we thus obtain better model-human alignment, better  context usage, and better translation quality. You can find more information  on our work in our paper as well as the code and data that are publicly  available, and we thank you for your attention :)

Kayo Yin: ACL 2021: Do Context-Aware Translation Models Pay the Right Attention? - Translation and proofreading