First time user. Wondering if there is much experience with a PDF NER model.
I am attempting to create a model to extract these custom labels: Patient First name, Patient Last Name, Date of birth,Date of Document Creation, Author of the Document. The data set would be unstructured PDF's with clinical information sent to doctors office. Essentially just for the purpose of taking incoming scanned PDF's and classifying them into specific patient charts.
Is the best way to do this to anotate a pdf project from scratch or is there a specific model that anyone has worked with that can be tuned. Obviously its best to have it be small enough to be able to run locally due to privacy concerns.
I had some reasonable success with the "impira/layoutlm-document-qa · Hugging Face" but it was particularly bad at splitting first and last name apart which is very important to be able to then tag this into the patient EMR.
Secondly, If i have an excel spreadsheet containing the last names, first names, and dates of birth of all the patients at a clinic (nearly 20,000 different entries) is it possible to train the model to recognize these and is this a better approach?
Thanks for your message and welcome to the Prodigy community
Have you seen our newly released Prodigy-PDF plugin? As a first step, you can use the image bounding boxes to choose what you want to annotate. Then run a 2nd step OCR correct recipe using py-tesseract out of the box to correct the OCR's text for each bounding box. Tesseract works fairly well well with black/white and English text. Now you'd have the text "digitized" so you could load it as a 3rd step you could try ner as normal.
Does this help?
@koaning also created a great Prodigy short on the plugin too:
Hm... so are you trying to train an NER model off of mostly tabular data? That I don't think may work out. NER would need more context to develop a predictive model. I've seen LSTM for character based models but not using Prodigy/spaCy in practice -- but I guess I'm not sure with this spreadsheet as the input, what are you trying to predict?
Currently what I’m doing is using the pdf plugin and following it with the ocr afterwards as you’ve mentionedbut I didn’t want to go through all of it to find out this wouldn’t be the best way to do it.
I will continue to annotate in this way then train the model.
The list was just a thought on possible ways to increase accuracy as I was having trouble getting accurate results previously. The vast majority of the documents clinics get are for their own patients so for this use case there will always be a master list that the model output for first, last name and date of birth can be compared to. I’m not sure if there’s a way this can increase the models accuracy.