I am currently working on differently structured data, more specifically CVs / resumes, and I try to extract entities like the persons name, adress, job titles, timespans and so on. However, the best precision I have gotten so far is a mere 20% using my model based on the standard german model (de_core_news_sm). I think that the problem here is, that CVs are not written in whole sentences and are strutured diffently and thus prodigy / spacy is not able to indentify the entities correctly, because of the missing context.
On the other hand I am not entirely sure whether my input data was correctly preprocessed. I downloaded about 40 german example CVs in various formats (PDF, DOC, Scans), extracted the plain text and processed them line by line. At first I thought that this would be the best idea, because CVs don’t have sentences or a uniformal layout and sometimes lines are labeled like “Name:” or “Street:”. In the end it didn’t quite work out as expected.
Thats why I came up with a few ideas, there may not be sentences in CVs, but there are certain structural properties to them. E.g. a lot of people seperate their CVs into paragraphs, sometimes even with horizontal lines (but it’s hard to transfer graphical elements to plain text). I am also wondering if I can use Prodigy/Spacy to annotate and detect these paragraphs and then use different models for personal details, work history, skills and so on. Furthermore the name and adress are most likely in the same paragraph. Perhaps I could then alter my POS tagging in order to recognize these structures, but before I jump to conclusions I would like to hear your input, because maybe I am just missing out on some details that would solve my problem.
Thanks in advance for any type of input!