New to Prodigy, but loving so far... thank you for this tool!
The company (I work for) has a business problem that deals with data extraction from PDFs, because of the volume and variance of pdfs, the automation of the extraction process has not been fully flushed out. Whether I was crazy enough to take on this enormous task is yet to be determined.
I was really thankful when I came across the Prodigy annotation solution and was hoping this would help solve my use case.
(This is likely going to be a 2 part question, I will try and add as much detail as possible).
My first attempt at annotating was using the PDF-Prodigy, here is a similar recipe to the one I used:
prodigy pdf.image.manual pdf_text path/to/dir/pdfs --label ACCOUNT_NAME, ACCOUNT_NUMBER, INVOICE_DATE, INVOICE_NUMBER,SERVICE_TYPE, TOTAL_AMOUNT_DUE, DUE_DATE, ADDRESS, SERVICE_ID, PHONE_NUM,VENDOR --remove-base64
I still had a ton of issues with the image data continuing to be stored in the jsonl file, my output was not usable and I tried a bunch of different .py scripts to cleanse and format the file, with no luck of removing all the 'garbage-y characters'.
So the current method that I am using is the ner.manual (I do eventually want to train a model to extract the data itself, from a new invoice directory here is the recipe that I am currently using
prodigy ner.manual extract_text en_core_web_sm path/to/jsonl --label ACCOUNT_NAME,ACCOUNT_NUMBER,INVOICE_DATE,INVOICE_NUMBER,SERVICE_TYPE,TOTAL_AMOUNT_DUE,DUE_DATE,ADDRESS,SERVICE_ID,PHONE_NUM
My output JSONL file is getting better results, the format looks more json-y, but the labels are still not attached to the values. Here is a sample of my output:
{"text": "1234567"}
{"text": "7654321"}
{"text": "March 21, 2024"}
{"text": "February 20, 2024"}
I would love to see this as the output, the way that I am annotating the data in the GUI
{"ACCOUNT NUMBER": "1234567"}
{"INVOICE NUMBER": "7654321"}
{"DUE DATE": "March 21, 2024"}
{"INVOICE DATE": "February 20, 2024"}
As promised, the 2 questions:
1. Does this have to do with an issue with my .prodigy.json file? Here is how the file looks:
{
"split_sents": false,
"custom_theme": {
"labels": ["ACCOUNT_NAME",
"ACCOUNT_NUMBER",
"INVOICE_DATE",
"INVOICE_NUMBER",
"SERVICE_TYPE",
"TOTAL_AMOUNT_DUE",
"DUE_DATE",
"ADDRESS",
"SERVICE_ID",
"PHONE_NUM",
"VENDOR"],
"db": "postgres",
"db_settings": {
"mysql": {
"user": "user name",
"password": "user password",
"host": "host name",
"port": 3306,
"database": "prodigydb",
"ssl": {
"ssl": {
"ssl-ca": "certificate.crt.pem"
}
}
}
}
}
}
2. Would you recommend a different process for my use case or do you believe that I am on the correct path?
Thank you so much!
jess.b.lee