Labelling / Annotating data affected by pre-processing

hjjg · September 3, 2019, 2:17pm

Hey folks,

we wrote some code to extract text from specific documents (e.g. PDF) and now we want to label the data with prodigy. Easy task. The Problem that I see is that we know that the extraction part is not perfect and will be changed many times in the future. There are problems with letter spacing, for example, that result in strings like "H eadline" with a space between "H" and "eadline". This is specific to the tool you use to extract the text. There might be more.

If I label the data based on the current state of pre-processing, the labels are worthless as soon as I change the processing pipeline. Same goes for changes within spaCys tokenizer.

Is there a clever way to get around that? Maybe tell prodigy to store the line number and plaintext of the entity within the JSONL-file?

Thanks in advance,

hjjg

honnibal · September 4, 2019, 2:45pm

It's sort of a hassle, but you could make sure that if something edits the text, you always store an offset mapping the characters before and after? That way, you'll always be able to fix the alignment.

Another approach is, if the changes can only affect whitespace, you could always recalculate the alignment and use that to remap the character stand-offs.

In the worst-case you can use an approximation like the Levenshtein alignment to calculate a best guess. This would probably work fine in most cases.

In all of these situations, what you're trying to do is create a mapping table keyed by the old character offsets, where the value is the new character offset. This lets you project the annotations onto the new string.

Topic		Replies	Views
Annotating strings without correct separation ner , best-practices	8	192	November 21, 2024
Loading non-Prodigy pre-annotated text relations	1	87	May 28, 2024
Boundaries (token/offsets) on Ner annotations ner , database , solved	1	535	October 16, 2019
Matching tokenisation on pre-existing annotated data usage , ner , spacy , solved	2	553	March 27, 2020
Data format for label correction task based on pre-labelled dataset Getting Started	5	351	June 24, 2022

Labelling / Annotating data affected by pre-processing

Related topics