Hey folks,
we wrote some code to extract text from specific documents (e.g. PDF) and now we want to label the data with prodigy. Easy task. The Problem that I see is that we know that the extraction part is not perfect and will be changed many times in the future. There are problems with letter spacing, for example, that result in strings like "H eadline" with a space between "H" and "eadline". This is specific to the tool you use to extract the text. There might be more.
If I label the data based on the current state of pre-processing, the labels are worthless as soon as I change the processing pipeline. Same goes for changes within spaCys tokenizer.
Is there a clever way to get around that? Maybe tell prodigy to store the line number and plaintext of the entity within the JSONL-file?
Thanks in advance,
hjjg