Processing CVs / resumes

Hi there,

I am currently working on differently structured data, more specifically CVs / resumes, and I try to extract entities like the persons name, adress, job titles, timespans and so on. However, the best precision I have gotten so far is a mere 20% using my model based on the standard german model (de_core_news_sm). I think that the problem here is, that CVs are not written in whole sentences and are strutured diffently and thus prodigy / spacy is not able to indentify the entities correctly, because of the missing context.

On the other hand I am not entirely sure whether my input data was correctly preprocessed. I downloaded about 40 german example CVs in various formats (PDF, DOC, Scans), extracted the plain text and processed them line by line. At first I thought that this would be the best idea, because CVs don’t have sentences or a uniformal layout and sometimes lines are labeled like “Name:” or “Street:”. In the end it didn’t quite work out as expected.

Thats why I came up with a few ideas, there may not be sentences in CVs, but there are certain structural properties to them. E.g. a lot of people seperate their CVs into paragraphs, sometimes even with horizontal lines (but it’s hard to transfer graphical elements to plain text). I am also wondering if I can use Prodigy/Spacy to annotate and detect these paragraphs and then use different models for personal details, work history, skills and so on. Furthermore the name and adress are most likely in the same paragraph. Perhaps I could then alter my POS tagging in order to recognize these structures, but before I jump to conclusions I would like to hear your input, because maybe I am just missing out on some details that would solve my problem.

Thanks in advance for any type of input!


I think you’re right that much of the information in the documents you’re dealing with will be in the formatting metadata. There’s no generic way to handle this, as each document will have different ad hoc conventions. I’ve often wondered whether computer vision approaches would actually be best for this.

You might be able to use spaCy to make features that help you segment the text. On the other hand, maybe a different approach will be better. It’s hard to say.

spaCy doesn’t have a statistical model that’s particularly well suited to inserting segment boundaries into documents. For instance, the named entity model isn’t a good fit for tagging whole sections — the features are really designed for smaller phrases that have consistencies in their beginnings and endings. You could try applying the text classifier to whole paragraphs, but depending on your labelling scheme, it may or may not work well.

I know this isn’t the most helpful reply — but unfortunately I don’t have any generic solutions!

Hey Matthew @honnibal ,

The above question is about topic segmentation of Curriculum Vitae. Let me put my thoughts about that.

Item 1. Mor, Noam, et al. show the use of bidirectional LSTMs to identify breaks between segments of Wikipedia articles.

Item 2. Textkernel company works on CV parsing. They have two-stage system: a first model segments the CV into sections and then specialised models handle each individual section. Textkernel treats the topic segmentation task as a sequence labeling problem. They shared some experience on topic segmentation of Curriculum Vitae.

In the beginning, they used an approach based on words and Hidden Markov Models, but that was problematic. From time to time, the model hallucinated new sections just because certain ambiguous words were present. In addition, it is hard for HMMs to take advantage of crucial multi-word clues like section headers (e.g. “Work experience”), presentation clues (e.g. lines that start with dates), etc. By making the simplifying assumption that a line can belong to a single section, Textkernel took advantage of Conditional Random Fields. This assumption also simplified the problem as Textkernel did not have to label sequences of 2000+ words, but sequences of 100+ lines. The improvements were impressive (50% reduction in errors) and this approach has become Textkernel's baseline for all languages.

By the way, an author of "Topic Segmentation of Curriculum Vitae" (2015) introduced a topic boundary detection algorithm that is also based on Conditional Random Fields.

Texkernel switched to Deep Learning based approach starting from the late of 2017. So they designed BiLSTM-CRF model in a flexible way. So the model allows them to work with different sequence labeling problems. An architecture of the model in case of phrase extraction is shown bellow.

"When the working entity is a phrase, i.e. sentences or lines, the model is able to generate a phrase representation to feed to the network and label a sequence of phrases. In this case, a Convolutional Neural Network (CNN) is applied to combine embeddings of all tokens into one."

Item 3 There exist one more approach for topic segmentation of CV. The problem of restructuring CV is posed as a section relabeling problem, where each section of the given CV gets reassigned to a predefined label. The relabeling method relies on semantic relatedness computed between section header, content and labels, based on phrase embeddings learned from a large pool of CVs.

It seems that usage of BiLSTM-CRF model is the best option in case of topic segmentation of Curriculum Vitae. I made such conclusion mostly because of Textkernel uses this model in production and reports a good accuracy.

I would like to know your thoughts on usage of BiLSTM-CRF based model in case of topic segmentation of Curriculum Vitae. Thank you in advance.


Hi Andrei,

I'm afraid the question goes a bit out of scope of what we can cover as Prodigy support. I haven't worked on CV segmentation myself, and I can't really engage with the papers you've linked. I'm sure if there's published results on this approach, it seems reasonable?

Hi Matthew,

Thank you for your reply and sorry for offtop.

Yes, It certainly seems so.