Documents annotations (from .pdf,.doc,.docx resumes)

Hi there,

I actually using to annotate documents raw text extracted with "Textract" library, but I have some problems because of the documents format I'm using. I end up with a lot of unstructured raw texts, as exemple :

In the doc :

2010 - 2020 : Senior seller at company
Blabla Description text...

Raw text extracted :

2010 -
2020 : Senior
at company
Description text...

this problem is mainly due during table rendering, and/or Textract capabilities.

Due to those problems, I have some low scores on entity recognition, and I will need A LOONG LONG time to be able to get a usable model.

So, my idea is to transform doc to HTML to preserve the structure of my documents, and then annotate them in Prodigy. But, I don't know if it's the best solution, does someone already had those problems ?

I've read that I can use a dataset in this format :

{"text": "my raw text", "html": "<b>My html rendered text</b>"}

haven't tried it yet, is it make sens for my case ?

I think the problem here goes a bit deeper: ultimately, what you'll be updating your model with is raw text. No matter how you present or annotate your data, at the end of it, you need to feed the model raw text and something you want it to predict (labels, character offsets into the text etc.).

If your data is HTML and you're rendering that, there's no clear answer for how to resolve annotations you create back to the original text, or how to deal with more complex markup. And at the end of it, you still have the same raw fragmented text, with or without added markup. I've explained some of the considerations and reasoning behind this in the following threads:

If you're working with tabular data with very little natural language texts, it's possible that approaching this as a basic sequence tagging / NER problem just isn't a good fit. NER works well for tasks where you need to predict exact boundaries based on the surrounding tokens – like mentions of names and concepts in text. But if you have no context and no real text, it's not surprising that you're seeing poor results.

For the specific use case here, I think some more preprocessing and extraction rules can make a big difference. You don't need deep learning to figure out what time periods like "2010 - 2020" are. And if you know that you have a table, you can use a PDF extraction tool that extracts the tables as a CSV (or similar), so you know what text belongs together and is part of the same column and work from there. [job title] at [company] is probably a super common construction, so you can easily cover those and focus on the more difficult cases, and maybe that's where you actually want to start predicting custom things.

An alternative approach that I've been seeing more often is framing the whole problem differently and as a computer vision task. This seems to be especially effective if the visual strucutre of the documents holds a lot of important clues, like in an invoice. So the model would then predict where the recipient or total amount is, and in the next step, you'd use OCR to convert the contents of the bounding box to text. This approach is more involved, though, and potentially overkill for this specificl use case.

Yeah i think my main problem come from the data, as you said, maybe for this task I will have better results by process the document and parse it with other tools, then format a nice contextual raw text for NL.

I’m going to dig into this, I got 30K resumes on multiples formats, I’m going to look to create a little « formatter » to reformat those documents and add more context.

Thanks for your help !

Just another question, when using ner.correct/ner.manual, binary(accept/refuse) isn’t used for train until I specify the binary argument right ?

Yes, that sounds like a good plan. Also, if you haven't done it yet, do a simple rule-based baseline for comparison. Like, how far you get with more sophisticated PDF parsing and some clever regex etc. Then you know what score you need to beat with any ML approach that you try.

Yes, that's correct. Setting --binary really makes the most sense if your annotations were collected with ner.teach, because that's where you actually get a lot of meaningful "reject" answers that can make a difference. If you're annotating manually, you might as well complete the annotations.

1 Like

Yeah, as you said, I have to get the right tool for every entity.
email/url can be extracted by regex, its a common schema.
I'm going to try to get only stuff that will be too hard/complexe to get with regex like :
ORG,GPE,DEGREE,SCHOOL,PERSON,LANGUAGE who need particular context to be found
and get DATE,URL,EMAIL,PHONE with regex.

thanks for those informations, you helped me a lot.