I am currently working on differently structured data, more specifically CVs / resumes, and I try to extract entities like the persons name, adress, job titles, timespans and so on. However, the best precision I have gotten so far is a mere 20% using my model based on the standard german model (de_core_news_sm). I think that the problem here is, that CVs are not written in whole sentences and are strutured diffently and thus prodigy / spacy is not able to indentify the entities correctly, because of the missing context.
On the other hand I am not entirely sure whether my input data was correctly preprocessed. I downloaded about 40 german example CVs in various formats (PDF, DOC, Scans), extracted the plain text and processed them line by line. At first I thought that this would be the best idea, because CVs don’t have sentences or a uniformal layout and sometimes lines are labeled like “Name:” or “Street:”. In the end it didn’t quite work out as expected.
Thats why I came up with a few ideas, there may not be sentences in CVs, but there are certain structural properties to them. E.g. a lot of people seperate their CVs into paragraphs, sometimes even with horizontal lines (but it’s hard to transfer graphical elements to plain text). I am also wondering if I can use Prodigy/Spacy to annotate and detect these paragraphs and then use different models for personal details, work history, skills and so on. Furthermore the name and adress are most likely in the same paragraph. Perhaps I could then alter my POS tagging in order to recognize these structures, but before I jump to conclusions I would like to hear your input, because maybe I am just missing out on some details that would solve my problem.
I think you’re right that much of the information in the documents you’re dealing with will be in the formatting metadata. There’s no generic way to handle this, as each document will have different ad hoc conventions. I’ve often wondered whether computer vision approaches would actually be best for this.
You might be able to use spaCy to make features that help you segment the text. On the other hand, maybe a different approach will be better. It’s hard to say.
spaCy doesn’t have a statistical model that’s particularly well suited to inserting segment boundaries into documents. For instance, the named entity model isn’t a good fit for tagging whole sections — the features are really designed for smaller phrases that have consistencies in their beginnings and endings. You could try applying the text classifier to whole paragraphs, but depending on your labelling scheme, it may or may not work well.
I know this isn’t the most helpful reply — but unfortunately I don’t have any generic solutions!
The above question is about topic segmentation of Curriculum Vitae. Let me put my thoughts about that.
Item 1.Mor, Noam, et al. show the use of bidirectional LSTMs to identify breaks between segments of Wikipedia articles.
Item 2.Textkernel company works on CV parsing. They have two-stage system: a first model segments the CV into sections and then specialised models handle each individual section. Textkernel treats the topic segmentation task as a sequence labeling problem. They shared some experience on topic segmentation of Curriculum Vitae.
Texkernel switched to Deep Learning based approach starting from the late of 2017. So they designed BiLSTM-CRF model in a flexible way. So the model allows them to work with different sequence labeling problems. An architecture of the model in case of phrase extraction is shown bellow.
Question
It seems that usage of BiLSTM-CRF model is the best option in case of topic segmentation of Curriculum Vitae. I made such conclusion mostly because of Textkernel uses this model in production and reports a good accuracy.
I would like to know your thoughts on usage of BiLSTM-CRF based model in case of topic segmentation of Curriculum Vitae. Thank you in advance.
I'm afraid the question goes a bit out of scope of what we can cover as Prodigy support. I haven't worked on CV segmentation myself, and I can't really engage with the papers you've linked. I'm sure if there's published results on this approach, it seems reasonable?