When I use google translate (to translate from English --> Deutsch OR Vice Versa) on the web interface of Prodigy for annotating the data, the entity is not picked up by the tool but when I go back to the original translation of the webpage, then it works as expected.
Below is the screenshot:
- Translating from English -> German (entity is not picked up):
- Translating to original version i.e from German to English (entity is picked up fine):
Is there a way around it or can there be a feature update for this in coming version support?
As it would be very helpful to annotate data where the user is not fluent with that language and can translate to annotate the data and then move it back to original language so that the model can be trained as is.
Hi! It looks like this might be related to how the browser substitutes the text when you run Google Translate. It's all plain text in the interface, so I'm surprised the browser messes this up. (From your screenshot, it also looks like it's translating every word individually, instead of the whole text in context, so the German text is pretty useless.)
That said, we'd strongly recommend against using a workflow like this and translating texts during annotation. It's going to be difficult to collect high quality data this way, because the entity boundaries aren't always going to map cleanly between languages, and the specific about the language are going to matter a lot for the annotation decisions. If annotators are using Google Translate, you'll also have no idea what people saw during annotation – none of this is going to be recorded in the underlying data. So annotators might be basing their decisions off something you can't control. (If you really want to use translated text, you should stream in the actual translated text, so you have a record of it, instead of relying on the browser to do it live. But again, I really wouldn't recommend that.)
Thanks for the reply!
I understand that it's not the ideal way and is not really recommended but I'm using it for a unique use case & the issue I have here is that in the Prodigy web framework (as in the screenshot) if I mark 'father' then it is annotated as 'Person' but when I translate the page to German and then highlight 'Vater' then it is not getting annotated or tagged as 'Person' in the Prodigy web framework.
Therefore, I wanted to have a way around it or wanted to know whether there can be a feature update for this in coming version support which would support annotation post browser translations?
NOTE: The translation is just to annotate the words by a person who does not have a strong hold of the language and later it has to be used in the original language and format when using it to train NER pipeline in spacy.
The underlying problem here isn't really related to how Prodigy works and not something we can easily work around: the tagging of the entities happens on the server, based on the original text. So the data you see in the app was generated by the server, using spaCy.
When you translate it in your browser using Google Translate, the translation browser extension will scrape the page and extract all texts, send that to Google Translate and replace it visually. All of this happens on the client. In the process of this, it seems to mess up the HTML markup used to display the entities. In order to display the tokens and make them selectable, they're wrapped in HTML elements. The translation feature also doesn't seem to handle that well, so it translates every word individually, which is why the translation is so useless.
Like I said, this is really not something we'd recommend. If you do not have a strong hold of the language, you probably shouldn't be creating training data for it. There are many details of a language that can get lost during machine translation – and a good translation doesn't necessarily map 1:1 to the original text. So if your goal is to annotate boundary-sensitive spans (like NER), you can very easily end up with very low-quality data.