LABELS showing as TXT in DB-Output JSONL && PDF-Prodigy Approach

magdaaniol · May 25, 2024, 10:54am

Welcome to the forum @jess.b.lee

Glad to here you've enjoyed working with Prodigy so far

As for your question 1) (the missing labels). Could you share the full annotated example? Surely it's not just {"text": "1234567"}, is it? The NER labels should be stored under spans key and they should be linked back to the text via token offsets. If you can share the whole example, I can help finding the right information.
Also, it was not entirely clear to me what is your current input to ner.manual. Are these the original PDFs processed via OCR or otherwise converted into a text format?

As to the general strategy, it very much depends on the PDFs you are working on. If the placement of the information on the document is a strong cue for the category then you might have a image classifier in your pipeline that would output the relevant regions. This is what Prodigy-PDF could help with. Since you want to output text eventually, the next component in your pipeline should convert these relevant regions to text: this can be as easy as scraping using something like PyPDF2(you can see how to integrate such scraping as Prodigy loader in this post) or you might need to resort to OCR (one option is available via pdf.ocr.correct recipe in the Prodigy-PDF plugin). Finally, looking that your categories are generally regex friendly, you could have a component that matches the text against the patterns to boost the precision.
Alternatively, you could try convert the PDFs to text, evaluate its quality and try using that as input to ner.manual to train a NER model. Training a NER model would only makes sense if you work with entities in some context which is why you'd need entire PDFs or bigger relevant regions. Still given the nature of the categories you're after I'd definitely combine NER model with the rules. The rules should help with precision (after all all these number will be similar and may look ambiguous to the model) and NER should be useful when context matters e.g. distinguishing between the INVOICE_DATE and DUE_DATE. You could have some categories covered by the model and some by the patterns or you could use the patterns to correct the model output. It's a matter of experimenting really what works best for your kind of data. In any case, even if you opt for rules only, make sure you have a nice development set to be able to measure the effects of the the rules as you develop them.

Topic		Replies	Views
HTML to jsonl and NER task workflow usage , ner , solved	6	851	July 19, 2019
Pretrain Model to extract data from PDFs using .jsonl data	5	456	May 9, 2024
JSONL with annotation for NET multi-tag for newbies usage , ner	3	657	February 14, 2022
Cant load pre-annotated ner jsonl usage , ner , solved	8	1179	June 24, 2020
Need to create a jsonl file on python according to certain format usage , third-party	1	808	October 2, 2019

LABELS showing as TXT in DB-Output JSONL && PDF-Prodigy Approach

Related topics