Documents annotations (from .pdf,.doc,.docx resumes)

ines · March 26, 2020, 1:48pm

I think the problem here goes a bit deeper: ultimately, what you'll be updating your model with is raw text. No matter how you present or annotate your data, at the end of it, you need to feed the model raw text and something you want it to predict (labels, character offsets into the text etc.).

If your data is HTML and you're rendering that, there's no clear answer for how to resolve annotations you create back to the original text, or how to deal with more complex markup. And at the end of it, you still have the same raw fragmented text, with or without added markup. I've explained some of the considerations and reasoning behind this in the following threads:

If you're working with tabular data with very little natural language texts, it's possible that approaching this as a basic sequence tagging / NER problem just isn't a good fit. NER works well for tasks where you need to predict exact boundaries based on the surrounding tokens – like mentions of names and concepts in text. But if you have no context and no real text, it's not surprising that you're seeing poor results.

For the specific use case here, I think some more preprocessing and extraction rules can make a big difference. You don't need deep learning to figure out what time periods like "2010 - 2020" are. And if you know that you have a table, you can use a PDF extraction tool that extracts the tables as a CSV (or similar), so you know what text belongs together and is part of the same column and work from there. [job title] at [company] is probably a super common construction, so you can easily cover those and focus on the more difficult cases, and maybe that's where you actually want to start predicting custom things.

An alternative approach that I've been seeing more often is framing the whole problem differently and as a computer vision task. This seems to be especially effective if the visual strucutre of the documents holds a lot of important clues, like in an invoice. So the model would then predict where the recipient or total amount is, and in the next step, you'd use OCR to convert the contents of the bounding box to text. This approach is more involved, though, and potentially overkill for this specificl use case.

Topic		Replies	Views
Using ner.manual on HTML Input usage , ner , custom	3	2812	October 12, 2018
NER document Labeling ner , solved	25	3695	August 1, 2019
Annotate Raw HTML usage , front-end , solved	2	1063	January 23, 2020
annotating entities in text documents usage , ner , solved	15	9946	November 28, 2017
Ambiguous NER annotation decisions usage , ner , solved , best-practices	12	4682	February 12, 2018

Documents annotations (from .pdf,.doc,.docx resumes)

Related topics