Testing Machine Translation via Referential Transparency

In Translation and proofreading


Testing Machine Translation via Referential Transparency - read the full article about machine translation, Translation and proofreading and from ICSE2021 Conference on Qualified.One
alt
ICSE2021 Conference
Youtube Blogger
alt

hi everyone i am pinjachu in this talk i will introduce our work testing machine translation via referential transparency machine translation has been widely used in our daily lives nowadays we have a lot of machine translation systems such as google translate and bing microsoft translator machine translation has also been used in industry for example ebay is a company that is selling things to more than 20 countries in the world and these countries are using different languages ebay has more than 800 million listings each of them has around 300 words and ebay needs to provide translations for all of them and this is not an easy job ebay estimates that if they have 1000 translators using five years they will be able to finish 60 million listings only for one language setting not to mention that a lot of those listings are being updated frequently thus ebay heavily relied on machine translation nowadays machine translation can already return high quality results however sometimes they can also return incorrect translations and here is an example a few years ago in the winter olympic game in korea the norwegian team intended to order 1500x for their athletes so they asked google translate for help but at the end they receive a full track of x and they find out that this was caused by a translation error that their text was translated to 15 000 row x thus we can see that incorrect translations can lead to misunderstanding unpleasant experience or even potential financial loss and so how to get rid of this kind of translation errors and make the translation software more robust what ebay is doing is heavily rely on regular expression for example this is a translation that translates the plural form of an acronym in english to french and as a background knowledge the plural form of an acronym should not have an x at the end of the acronym in french however for machine translators sometimes they will make this error so to detect this kind of grammar errors ebay use this regular expression and to automatically extract all those potential errors that has an x at the end of the acronym we can see that this approach is highly limited by the domain knowledge of developers and because they need to write a lot of those rules it is also very labor intensive so we were thinking is it possible for us to design some fully automated approach that can report general translation errors and this is what we have done in this work what we have done is using only 200 source sentences without any label we found 265 erroneous translations in google translate and bing microsoft translator our proposed technique is called rti that is short for testing via referentially transparent input its core idea is referentially transparent tasks should have similar translations when used in different contexts it was inspired by the concept in functional programming in functional programming we say that a function is referentially transparent if the return value is the same for a given input even it is used in different contexts for example assume we have a function square that returns the square of the input and we give 2 as the input we got 4 and after a few minutes we got 4 and after another few minutes we still got 4. so the return value is the same for a given input when it is called in different contexts and here for example is the time another is the time function we call which returns the timestamp so we call it in different context it will return different results so time is not referentially transparent at that time we were thinking is it possible for us to find something similar in natural language and in our work we define the rti in natural language as a noun phrase within a specific language and we extract those rti from a random text for example here assume we have this homes in the movie based on bad blood we want to extract an rti we first use a constituency parser to get a true form structure and then we find the num phrases in the tree and we select those within a specific length range for example here we will select this noun phrase a movie based on bad blood and this rti will be paired with the original text the containing text to generate an rti pair and then we collect the translations from the software we want to test and now based on our idea we compare the translations of the rti in these two phrases and see whether they are similar or whether they are the same and in this example the translation for rti in these two phrases are different by three characters so we can say by distance three and if the distance is larger than the threshold it will be reported as a suspicious issue and after getting this issue developer can find out that the first one was not translated correctly it was translated to holms blood becomes bad based on a movie our approach can report diverse kind of translation errors including unknown translation over translation word phrase mistranslation incorrect modification and unclear logic and i will give you two examples first under translation means that some parts of the source sentence was not translated in the target sentence here the more almost anxiety provoking magnitude of data was translated by google to the most anxiety provoking data so the magnitude of was not translated leading to an under translation error unclear logic means that all parts or phrases or words was correctly translated but the whole logic of the sentence is incorrect for example here a plural approval on two separate occasions was translated to approve two separate occasions leading to an unclear logic error next lets talk about the precision of our approach which means the percentage of issues containing errors among all those reported issues and this is one report issue assume we have five reported issues if three of them contains new error we will say that the precision is 60 compared with the existing approaches we have the highest precision which is around 80 percent for all those cases the last question is are the erroneous translations found by rti different the answer is yes we can see that in this venn diagram we show the different erroneous translations reported by the different approaches we can see that there is only a very small overlap in the middle and rtis rti can find more and report more erroneous translations than others last but not least we have released all the source codes in this paper in the repository on github feel free to take a look thank you hello everyone and welcome to the last talk in this uh session thank you pincha for your for your talk uh if anybody has questions as usual please type them in the chat and i would like to start with uh with a question so language translation is a very difficult problem and you know that very well uh one of the main problems like you said is is context context is a is a big problem um the venn diagram that you showed on slide 19 shows that rti detects a large number of different uh but there are also many that it does not detect right so so im just wondering like how how do you in in real practice um how do you expect your approach to be used do you expect it to be used in conjunction with other methods like what is what is what do you recommend yeah um i would expect a thanks for your question and i would expect the approach to be used um like in conjunction with others so is uh very good at detecting erroneous translations for phrases and for the other they are better on like detecting errors in sentences so practice we can use them together okay uh theres a question from the audience from leima when the mistranslation is detected and found how then can uh how can them can be how can your technique be used for model enhancement or how could it use be used by the developer yeah this is a great question so um it could be used like in multiple ways like first our approach we report a line of suspicious issues so for those developers they can get this line of reported ones so if they find out that some of them are like urgent one or very critical one they can just uh hard coded them in their translator and another one is for the others you could use them to fine-tune your model and that will also help to improve your model okay so while other questions come in i can i can ask another question so uh you said earlier that that you would expect your model to be used in conjunction with other models um so would you expect performance to be a concern because now its yet another diagnostic or another technique that would have to be used is this a concern at all like how fast is rti yeah thanks for your question um like our approach is running pretty fast so we have four sets of experiments uh like 100 sentences for google and etc so for each of them we can like finish the like processing in within one minute okay i found that okay lets see if more questions come in let me remind people to ask their questions sure thanks okay um so from from a research perspective what was what was the most difficult aspect of your of your research in this particular project thanks for the question uh i think in this uh project the most uh difficult like um step was to find like uh rti step like a piece of text that should remain like um similar or invariant translations uh in different contexts so thats a thats the most difficult part because usually we will have multiple translations for uh for the same source phrase or sentence and yeah thats the most difficult part and we try to use rti that is the noun phrase with us specific length branch to tackle this question lets tackle this challenge okay i see that makes sense so there is a theres a question but im not im not sure if its regarding these types of system if its related specifically to this to this uh presentation let me ask for clarification from um if you could clarify a little bit your question are you talking about testing in particular um translation systems and in the meantime a question from leima do you find that do you find do you think there are any relationship is there any relationship with the deep learning model architecture and the detected issues i believe that is the question what is the relationship with the the model architecture and the types of issues that that are being detected by your your model yeah thanks uh for the for the first one um work that has been done in mllp community regarding testing systems so what they have mainly done is um kind of adversarial examples generations and the main difference between them and us is that in most of their approaches uh what they which sentences that either like um synthetically or semantically incorrect example uh like i like basketball you can generate a sentence called like ibus or like and that may cause some translation so thats mainly done in nlp community and what our what we focus on is like to generate uh synthetically and semantically correct sentence so what race uh sentences like i like football oh i like volleyball those kind of sentences okay and layman added some clarification to the question um are there specific insights and why a model architecture would fail at a particular case i imagine this is in the context of of language translation yeah um actually um we havent go into the depth to like explore the internal details of the models but i have like a source some papers in nlp field uh they try to build a connection between the um internal like parameters and a specific kind of um translation error yeah if youre interested we can discuss later and i can share with you some uh links okay so so it so i guess the question in general theres some discussion around this is do you think that what you rti could be used on other uh for other nlp tasks yeah this is so the answer is yes okay yeah the answer is yes for example um you can use in phase recognition so our core idea is the text should remain the same context like when is you using different sentences so its like you can also find a similar thing in uh speech recognition all these stuff uh check that like generating tests from other things okay lets see ive got any other questions [Music] yeah the other question is related to the same thing okay lets see if there are any other questions a couple more minutes okay okay so pedro i think weve run out of questions so thank you very much for your for your time uh people are welcome to stay and and talk to pizza or move to the discussion room and if not um see you later during the conference hope to see you all take care thank you thank you all [Music] um your question maybe is the the difference between the works in nlp community and in sc community right yeah so what like people in nlp feel like uh mostly are doing is try to generate adversarial examples so most of the sentences or like examples reported by them are actually synthetically or semantically incorrect and what we have

ICSE2021 Conference: Testing Machine Translation via Referential Transparency - Translation and proofreading