LABELS showing as TXT in DB-Output JSONL && PDF-Prodigy Approach

New to Prodigy, but loving so far... thank you for this tool!

The company (I work for) has a business problem that deals with data extraction from PDFs, because of the volume and variance of pdfs, the automation of the extraction process has not been fully flushed out. Whether I was crazy enough to take on this enormous task is yet to be determined. :stuck_out_tongue_winking_eye:

I was really thankful when I came across the Prodigy annotation solution and was hoping this would help solve my use case.

(This is likely going to be a 2 part question, I will try and add as much detail as possible).

My first attempt at annotating was using the PDF-Prodigy, here is a similar recipe to the one I used:

prodigy pdf.image.manual pdf_text path/to/dir/pdfs --label ACCOUNT_NAME, ACCOUNT_NUMBER, INVOICE_DATE, INVOICE_NUMBER,SERVICE_TYPE, TOTAL_AMOUNT_DUE, DUE_DATE, ADDRESS, SERVICE_ID, PHONE_NUM,VENDOR --remove-base64

I still had a ton of issues with the image data continuing to be stored in the jsonl file, my output was not usable and I tried a bunch of different .py scripts to cleanse and format the file, with no luck of removing all the 'garbage-y characters'.

So the current method that I am using is the ner.manual (I do eventually want to train a model to extract the data itself, from a new invoice directory here is the recipe that I am currently using

prodigy ner.manual extract_text en_core_web_sm path/to/jsonl --label ACCOUNT_NAME,ACCOUNT_NUMBER,INVOICE_DATE,INVOICE_NUMBER,SERVICE_TYPE,TOTAL_AMOUNT_DUE,DUE_DATE,ADDRESS,SERVICE_ID,PHONE_NUM

My output JSONL file is getting better results, the format looks more json-y, but the labels are still not attached to the values. Here is a sample of my output:

{"text": "1234567"}
{"text": "7654321"}
{"text": "March 21, 2024"}
{"text": "February 20, 2024"}

I would love to see this as the output, the way that I am annotating the data in the GUI

{"ACCOUNT NUMBER": "1234567"}
{"INVOICE NUMBER": "7654321"}
{"DUE DATE": "March 21, 2024"}
{"INVOICE DATE": "February 20, 2024"}

As promised, the 2 questions:
1. Does this have to do with an issue with my .prodigy.json file? Here is how the file looks:

{
"split_sents": false,
"custom_theme": {
"labels": ["ACCOUNT_NAME",
"ACCOUNT_NUMBER",
"INVOICE_DATE",
"INVOICE_NUMBER",
"SERVICE_TYPE",
"TOTAL_AMOUNT_DUE",
"DUE_DATE",
"ADDRESS",
"SERVICE_ID",
"PHONE_NUM",
"VENDOR"],

"db": "postgres",
"db_settings": {
    "mysql": {
        "user": "user name",
        "password": "user password",
        "host": "host name",
        "port": 3306,
        "database": "prodigydb",
        "ssl": {
            "ssl": {
                "ssl-ca": "certificate.crt.pem"
                }
            }
        }
    }
}

}

2. Would you recommend a different process for my use case or do you believe that I am on the correct path?

Thank you so much!
jess.b.lee

Welcome to the forum @jess.b.lee :wave:

Glad to here you've enjoyed working with Prodigy so far :slight_smile:

As for your question 1) (the missing labels). Could you share the full annotated example? Surely it's not just {"text": "1234567"}, is it? :thinking: The NER labels should be stored under spans key and they should be linked back to the text via token offsets. If you can share the whole example, I can help finding the right information.
Also, it was not entirely clear to me what is your current input to ner.manual. Are these the original PDFs processed via OCR or otherwise converted into a text format?

As to the general strategy, it very much depends on the PDFs you are working on. If the placement of the information on the document is a strong cue for the category then you might have a image classifier in your pipeline that would output the relevant regions. This is what Prodigy-PDF could help with. Since you want to output text eventually, the next component in your pipeline should convert these relevant regions to text: this can be as easy as scraping using something like PyPDF2(you can see how to integrate such scraping as Prodigy loader in this post) or you might need to resort to OCR (one option is available via pdf.ocr.correct recipe in the Prodigy-PDF plugin). Finally, looking that your categories are generally regex friendly, you could have a component that matches the text against the patterns to boost the precision.
Alternatively, you could try convert the PDFs to text, evaluate its quality and try using that as input to ner.manual to train a NER model. Training a NER model would only makes sense if you work with entities in some context which is why you'd need entire PDFs or bigger relevant regions. Still given the nature of the categories you're after I'd definitely combine NER model with the rules. The rules should help with precision (after all all these number will be similar and may look ambiguous to the model) and NER should be useful when context matters e.g. distinguishing between the INVOICE_DATE and DUE_DATE. You could have some categories covered by the model and some by the patterns or you could use the patterns to correct the model output. It's a matter of experimenting really what works best for your kind of data. In any case, even if you opt for rules only, make sure you have a nice development set to be able to measure the effects of the the rules as you develop them.