Error annotation MT with source text

BramVanroy · November 10, 2021, 2:17pm

I've been going through some of the pre-existing recipes and I absolutely love the ease-of-use of the tool, the visuals and the customization options. In my research, I am interested in error/suggestion annotation of (machine) translation. You typically annotate words/spans of the translation and label them with an existing hierarchical category (similar to the span categorization recipe). A potential addition is allowing for a comment on top of the category label for the annotator to provide some more information about their choice. Those two things, I can figure out I think.

The potentially harder issue that I am faced with is being able to incorporate the source text in the annotation scheme. Some errors are not mistakes in the target language (like grammatical errors) but are wrong because it is not the correct translation. It would therefore be incredibly useful to be able to link the labeled translation span to a span in the source text. This would require a couple of things, and this topic is to ask whether that is feasible at all to create myself within Prodigy (or whether there are plans to have such functionality in the future). I can think of the following:

access to two sentences in the interface, the source sentence and the target sentence;
ideally the option to have (read) access to other sentences in the same document;
the ability to mar spans in both source/target (already possible with span categorization);
the ability to link two spans to each other across source/target sentences;
have a useful to export this information.

Linking spans to each other also seems useful for entity linking and other coreference use-cases. I guess that the hardest part would be to incorporate both a source and target sentence in a single annotation instance.

If this is not feasible within Prodigy because that's simply outside its scope, I completely understand!

ines · November 14, 2021, 10:32am

Hi! One quick solution if you're working with one sentence at a time could be to just combine both into a single "text" for annotation purposes only and then store the original input and output texts, as well as the character offsets in the underlying JSON. This way you're able to map the annotations on the "text" back to the original sentences. If you don't expect the spans to overlap, you could then use the relations UI to perform the linking: Annotation interfaces · Prodigy · An annotation tool for AI, Machine Learning & NLP

BramVanroy · November 15, 2021, 8:01am

The approach of having two sentences is indeed how I did it up to know with another tool. With that tool, post-processing the annotation is tedious and not straightforward to split again. I foresee that this is easier to do in prodigy's JSON, indeed!

The relations API cannot be used, unfortunately, because we expect the spans to overlap. The error categories are hierarchical and the annotations may overlap. Thanks for the reply, though! I'll have to continue my search.

Topic		Replies	Views
Displaying Span/Token Metadata usage , custom , front-end , relations	2	467	February 24, 2021
Correction of annotation in UI enhancement , done	5	1348	December 25, 2017
Two-Step Span Annotation usage , custom , spancat	2	438	April 19, 2022
NER, additional checking after highlighting spans usage , ner	2	275	July 2, 2021
Annotation for target-based sentiment usage , textcat	4	435	December 17, 2019

Error annotation MT with source text

Related topics