Welcome to the forum @jess.b.lee
Glad to here you've enjoyed working with Prodigy so far
As for your question 1) (the missing labels). Could you share the full annotated example? Surely it's not just {"text": "1234567"}
, is it? The NER labels should be stored under spans
key and they should be linked back to the text via token offsets. If you can share the whole example, I can help finding the right information.
Also, it was not entirely clear to me what is your current input to ner.manual
. Are these the original PDFs processed via OCR or otherwise converted into a text format?
As to the general strategy, it very much depends on the PDFs you are working on. If the placement of the information on the document is a strong cue for the category then you might have a image classifier in your pipeline that would output the relevant regions. This is what Prodigy-PDF could help with. Since you want to output text eventually, the next component in your pipeline should convert these relevant regions to text: this can be as easy as scraping using something like PyPDF2(you can see how to integrate such scraping as Prodigy loader in this post) or you might need to resort to OCR (one option is available via pdf.ocr.correct
recipe in the Prodigy-PDF plugin). Finally, looking that your categories are generally regex friendly, you could have a component that matches the text against the patterns to boost the precision.
Alternatively, you could try convert the PDFs to text, evaluate its quality and try using that as input to ner.manual
to train a NER model. Training a NER model would only makes sense if you work with entities in some context which is why you'd need entire PDFs or bigger relevant regions. Still given the nature of the categories you're after I'd definitely combine NER model with the rules. The rules should help with precision (after all all these number will be similar and may look ambiguous to the model) and NER should be useful when context matters e.g. distinguishing between the INVOICE_DATE and DUE_DATE. You could have some categories covered by the model and some by the patterns or you could use the patterns to correct the model output. It's a matter of experimenting really what works best for your kind of data. In any case, even if you opt for rules only, make sure you have a nice development set to be able to measure the effects of the the rules as you develop them.