Error annotation MT with source text

I've been going through some of the pre-existing recipes and I absolutely love the ease-of-use of the tool, the visuals and the customization options. In my research, I am interested in error/suggestion annotation of (machine) translation. You typically annotate words/spans of the translation and label them with an existing hierarchical category (similar to the span categorization recipe). A potential addition is allowing for a comment on top of the category label for the annotator to provide some more information about their choice. Those two things, I can figure out I think.

The potentially harder issue that I am faced with is being able to incorporate the source text in the annotation scheme. Some errors are not mistakes in the target language (like grammatical errors) but are wrong because it is not the correct translation. It would therefore be incredibly useful to be able to link the labeled translation span to a span in the source text. This would require a couple of things, and this topic is to ask whether that is feasible at all to create myself within Prodigy (or whether there are plans to have such functionality in the future). I can think of the following:

  • access to two sentences in the interface, the source sentence and the target sentence;
  • ideally the option to have (read) access to other sentences in the same document;
  • the ability to mar spans in both source/target (already possible with span categorization);
  • the ability to link two spans to each other across source/target sentences;
  • have a useful to export this information.

Linking spans to each other also seems useful for entity linking and other coreference use-cases. I guess that the hardest part would be to incorporate both a source and target sentence in a single annotation instance.

If this is not feasible within Prodigy because that's simply outside its scope, I completely understand!

Hi! One quick solution if you're working with one sentence at a time could be to just combine both into a single "text" for annotation purposes only and then store the original input and output texts, as well as the character offsets in the underlying JSON. This way you're able to map the annotations on the "text" back to the original sentences. If you don't expect the spans to overlap, you could then use the relations UI to perform the linking: Annotation interfaces · Prodigy · An annotation tool for AI, Machine Learning & NLP

The approach of having two sentences is indeed how I did it up to know with another tool. With that tool, post-processing the annotation is tedious and not straightforward to split again. I foresee that this is easier to do in prodigy's JSON, indeed!

The relations API cannot be used, unfortunately, because we expect the spans to overlap. The error categories are hierarchical and the annotations may overlap. Thanks for the reply, though! I'll have to continue my search.