Medical PDF model

victorsandu · October 27, 2023, 6:44pm

Hello,

First time user. Wondering if there is much experience with a PDF NER model.
I am attempting to create a model to extract these custom labels: Patient First name, Patient Last Name, Date of birth,Date of Document Creation, Author of the Document. The data set would be unstructured PDF's with clinical information sent to doctors office. Essentially just for the purpose of taking incoming scanned PDF's and classifying them into specific patient charts.

Is the best way to do this to anotate a pdf project from scratch or is there a specific model that anyone has worked with that can be tuned. Obviously its best to have it be small enough to be able to run locally due to privacy concerns.
I had some reasonable success with the "impira/layoutlm-document-qa · Hugging Face" but it was particularly bad at splitting first and last name apart which is very important to be able to then tag this into the patient EMR.

Secondly, If i have an excel spreadsheet containing the last names, first names, and dates of birth of all the patients at a clinic (nearly 20,000 different entries) is it possible to train the model to recognize these and is this a better approach?

Thanks in advance,
Victor

ryanwesslen · October 27, 2023, 6:53pm

Hi @victorsandu!

Thanks for your message and welcome to the Prodigy community

Have you seen our newly released Prodigy-PDF plugin? As a first step, you can use the image bounding boxes to choose what you want to annotate. Then run a 2nd step OCR correct recipe using py-tesseract out of the box to correct the OCR's text for each bounding box. Tesseract works fairly well well with black/white and English text. Now you'd have the text "digitized" so you could load it as a 3rd step you could try ner as normal.

Does this help?

@koaning also created a great Prodigy short on the plugin too:

Hm... so are you trying to train an NER model off of mostly tabular data? That I don't think may work out. NER would need more context to develop a predictive model. I've seen LSTM for character based models but not using Prodigy/spaCy in practice -- but I guess I'm not sure with this spreadsheet as the input, what are you trying to predict?

victorsandu · October 27, 2023, 8:03pm

Thanks!

Currently what I’m doing is using the pdf plugin and following it with the ocr afterwards as you’ve mentionedbut I didn’t want to go through all of it to find out this wouldn’t be the best way to do it.

I will continue to annotate in this way then train the model.

The list was just a thought on possible ways to increase accuracy as I was having trouble getting accurate results previously. The vast majority of the documents clinics get are for their own patients so for this use case there will always be a master list that the model output for first, last name and date of birth can be compared to. I’m not sure if there’s a way this can increase the models accuracy.

Thanks for the reply,
Victor

Topic		Replies	Views
Taking a Computer Vision Approach (leveraging image.manual) to build a custom NER model on PDFs usage , ner , image	3	582	July 28, 2022
LABELS showing as TXT in DB-Output JSONL && PDF-Prodigy Approach ner , install , custom	1	158	May 25, 2024
Review Approaches to NER on Unstructured Data (and Discussing Amazon Comprehend vs spaCy + Prodigy) ner , spacy , aws	6	1174	August 2, 2022
Prodigy UI Customization usage , front-end	1	623	January 31, 2022
finding areas on pdfs for downstream training	3	91	July 19, 2024

Medical PDF model

Related topics