Labelling / Annotating data affected by pre-processing

Hey folks,

we wrote some code to extract text from specific documents (e.g. PDF) and now we want to label the data with prodigy. Easy task. The Problem that I see is that we know that the extraction part is not perfect and will be changed many times in the future. There are problems with letter spacing, for example, that result in strings like "H eadline" with a space between "H" and "eadline". This is specific to the tool you use to extract the text. There might be more.

If I label the data based on the current state of pre-processing, the labels are worthless as soon as I change the processing pipeline. Same goes for changes within spaCys tokenizer.

Is there a clever way to get around that? Maybe tell prodigy to store the line number and plaintext of the entity within the JSONL-file?

Thanks in advance,


It's sort of a hassle, but you could make sure that if something edits the text, you always store an offset mapping the characters before and after? That way, you'll always be able to fix the alignment.

Another approach is, if the changes can only affect whitespace, you could always recalculate the alignment and use that to remap the character stand-offs.

In the worst-case you can use an approximation like the Levenshtein alignment to calculate a best guess. This would probably work fine in most cases.

In all of these situations, what you're trying to do is create a mapping table keyed by the old character offsets, where the value is the new character offset. This lets you project the annotations onto the new string.