Prodigy-PDF + OCR

I'm extracting information from text-only-based images in medical and hospital records in order to create a chat-box questions/answer such as "Does this patient need complex wound care? Does this patient need an IV?"
When using pdf.image.manual to read the documents, would you recommend specifying 2 simple and generic labels such as 'title' and 'text' and then relying on the NLP modeling sort out specific categories such as 'diseases', 'wounds', 'procedures', ' 'prescription drugs', 'dose' (not sure if those should be two separate labels or one?), 'equipment', 'therapy', etc.
OR
would you recommend initial, specific labels for each category within the pdf.image.manual recipe?
Any help would be greatly appreciated,
Thanks, Rafi

hi @Rafi,

Deciding between using generic versus specific labels largely depends on your specific use case and the complexity of the information you're dealing with.

If your documents have a consistent structure and you're confident that NLP models can accurately extract the information you need, then using generic labels like 'title' and 'text' could work. The NLP model would then be responsible for further categorizing the extracted text.

However, if the documents are complex and varied, or if you need more control over the categorization, then using specific labels might be a better approach. This would allow you to directly annotate for 'diseases', 'wounds', 'procedures', 'prescription drugs', 'dose', 'equipment', 'therapy', etc.

As for 'prescription drugs' and 'dose', whether they should be separate labels or not depends on how closely related they are in your context. If the dose is always mentioned alongside the prescription drug and you want to capture that relationship, it might be useful to have a single label for them. If they can appear independently, separate labels might be more appropriate.

Keep in mind that the more specific your labels are, the more precise your model can be. However, it also means more work during the annotation process.

You might also want to consider a hybrid approach, where you start with a few broad categories and then refine them into more specific labels as you understand your data better.

Remember, the goal is to create labels that will help your model learn the distinctions that are important for your task.

Hope this helps!

2 Likes