Annotating custom entities in job descriptions

Hi there!

I am currently in the process of annotating custom entities in job descriptions. Terms that I am interested in labelling are the required programming/machine learning skills, the language that a candidate should speak and the years of work experience that he needs.

For language terms (english, dutch, spanish) this worked very well, following the standard process of using some seed terms to create a terminology list and use that terminology list to then go through my job descriptions to do the annotations.

For programming languages (python, r, sql etc.) it works rather well, but at one point it starts suggesting blank spaces, dots and some random suggestions, while I do not have the feeling that I annotated my whole dataset yet. Is it possible for Prodigy to just stick to the terminology list that I created?

For work experience (0 to 2 years / minimum of 2 years etc.) and machine learning skills (xgboost, decision trees, neural networks) I did not expect any good results when creating the terminology list. However, when using the seed terms on the actual data it does not detect anything and just suggests random words. What should I change in my process to improve on this?

Kind regards,
Björn

Hi! Could you share a bit more about your workflow? Are you trying to train a model from scratch, or are you using a pre-trained model and want to add categories?

If you're using an active learning-powered recipe like ner.teach, keep in mind that it will score the incoming stream to show you the most relevant examples. This always means that it'll skip some examples in favour of others. If you only see random suggestions and no matches, it's possible that there's just not enough relevant text in your data, or that the model wasn't able to learn enough yet to be able to make meaningful suggestions.

If you only want to view and annotate matches from your patterns.jsonl file, you can also use the ner.match recipe. This will go through all examples as they come in. This process is also a good option to give you a better feeling for you data and what's in there – maybe it turns out that you want to add some more patterns to cover more examples. You could even experiment with some more ambiguous patterns – for instance, token-based patterns for single tokens consisting of a certain number of capital letters (SQL, HTML, CSS, R, COBOL etc.). Yeah, you'll probably get some number of false positives that you'll have to reject – but it'll again give you a better feeling of the data and whether it contains a lot of words that may throw off your model.

Finally, there are always cases where you want to hand-label the first set of examples to bootstrap a category and pre-train the model effectively. Depending on the entity type and that data, you'd have to experiment with different strategies to find out what works best.

Did you use the terms.teach with a list of seed terms? And if so, which vectors were you using? If you were trying to find similar terms to multi-word prases ("neural networks"), this might explain what's going on here. The en_vectors_web_lg we ship with spaCy only include vectors for single tokens, not for merged noun phrases. So you won't find anything for "neural networks", and the resulting most similar terms will be completely random. So you'd either want to be training your own vectors that include merged noun phrases (see the sense2vec preprocessing scripts for examples), focus on single tokens or use a different method to create your initial terminology lists (online lists or databases?).

Thank you for you response! I am trying to train a model from scratch for the entities that I mentioned. So the steps that I took for languages were as follows:

prodigy dataset language_terms "Seed terms for LANGUAGE label"
prodigy terms.teach language_terms en_core_web_lg --seeds "english, dutch, spanish"
prodigy terms.to-patterns language_terms language_patterns.jsonl --label LANGUAGE
prodigy ner.teach language_ner en_core_web_lg df_english.jsonl --label LANGUAGE --patterns language_patterns.jsonl  
prodigy ner.batch-train language_ner en_core_web_lg --output language-model --label LANGUAGE --eval-split 0.2 --n-iter 6 --batch-size 8
prodigy db-out language_ner language-model

Is it also possible to automatically annotate all mentions of SQL or Python without having to go through them?

I will just build a terminology list then for the data science skills and use ner.match for my data!

One more question. Is it possible to export from Prodigy directly into the following format with the sentence number, word and label?
output

If they're in your incoming feed, you do have to annotate them – if Prodigy just quietly accepted them, things would become very unpredictable. However, you can always create the JSONL records in the same format automatically and then add them to your annotated dataset.

Sure, you just have to write a script that does this. Use db-out or connect to the Prodigy database in Python, process each text with spaCy and extract the token-based tags you need. To create the token-based entity tags from the character offsets, you can use spaCy's biluo_tags_from_offsets helper.

Thanks for your help Ines! I’m still struggling a bit though with how to get a complete dataset with my data in a fast way. Using ner.teach I annotated two for two labels, these are the stats:

Dataset language
Created 2019-05-30 14:16:05
Description None
Author None
Annotations 501
Accept 264
Reject 228
Ignore 9

I then exported the dataset using db-out and imported the language.jsonl file into ner.make-gold. However, only little suggestions were given and after doing some annotations I ended up with 210 annotations.

My goal is to try out some different models on the data, like bi-lstm. Is there anything that I can change about my process to get a complete dataset relatively fast or do I just have to spend more time annotating?

Okay so after reading around on the forum I have come up with the following strategy. I will split the raw data that I collected before starting with the annotation into a training and a evaluation set.

  • For the test set I will use ner.manual and start annotating for all of my labels at the same time.
  • For the training set I will start training a model for each label using ner.teach. Once I have enough annotations for each label I will then combine the datasets. A problem that i found with ner.match is that once I output the dataset there are a lot of duplicated lines whenver there are multiple occurrences of a label in that line. This complicates my post-processing since I want to create a dataset shown in my previous post.

Question: Is just using ner.teach going to make my training dataset complete enough for it to be used as an input for other models outside of prodigy/spacy?

Yes, that sounds like a good plan!

Yes, because the decisions are binary, each example you accept/reject will be saved as a separate annotation. You can always use the "_input_hash" to find annotations on the same input data. You could also use Prodigy's built-in merge_spans helper to merge the examples, so each input only exists once and contains all annotated spans with their respective accept/reject decisions. Just note that the result will also include the rejected spans, which will have an "answer": "reject" added to the span dict.

from prodigy.models.ner import merge_spans
from prodigy.components.db import connect

db = connect()
examples = db.get_dataset("your_dataset")
merged_examples = merge_spans(examples)

The annotations you collect with ner.teach will be binary – so if you want to train from that directly, your model needs to be able to learn from binary annotations and/or incomplete data. If you need gold-standard annotations (where every token is annotated and all unannotated tokens are not entities), you can always convert your binary "silver" annotations to a "gold" dataset later.

Prodigy v1.8 now ships with a built-in ner.silver-to-gold workflow (example recipe script here), which takes a dataset with binary accept/reject annotations and uses the model to create the best possible analysis of the examples given the constraints defined in the annotations. So basically, all examples will be merged and entities consistent with your accept/reject decisions will be pre-highlighted. You can then correct them manually.

Thanks again! Upgraded to Prodigy v1.8, but consistently getting the following error now when using ner.teach. My stats are as follows:

> Version          1.8.2
> Location         C:\Users\nldijkm8\AppData\Roaming\Python\Python37\site-packages\prodigy
> Prodigy Home     C:\Users\nldijkm8\AppData\Roaming\Python\Python37\site-packages\prodigy
> Platform         Windows-10-10.0.17134-SP0
> Python Version   3.7.3
> Database Name    SQLite
> Database Id      sqlite
> Total Datasets   33
> Total Sessions   107
  File "C:\Users\nldijkm8\AppData\Local\Continuum\anaconda3\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "C:\Users\nldijkm8\AppData\Local\Continuum\anaconda3\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\Users\nldijkm8\AppData\Roaming\Python\Python37\site-packages\prodigy\__main__.py", line 380, in <module>
    controller = recipe(*args, use_plac=True)
  File "cython_src\prodigy\core.pyx", line 212, in prodigy.core.recipe.recipe_decorator.recipe_proxy
  File "C:\Users\nldijkm8\AppData\Local\Continuum\anaconda3\lib\site-packages\plac_core.py", line 328, in call
    cmd, result = parser.consume(arglist)
  File "C:\Users\nldijkm8\AppData\Local\Continuum\anaconda3\lib\site-packages\plac_core.py", line 207, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "C:\Users\nldijkm8\AppData\Roaming\Python\Python37\site-packages\prodigy\recipes\ner.py", line 120, in teach
    nlp = spacy.load(spacy_model)
  File "C:\Users\nldijkm8\AppData\Roaming\Python\Python37\site-packages\spacy\__init__.py", line 27, in load
    return util.load_model(name, **overrides)
  File "C:\Users\nldijkm8\AppData\Roaming\Python\Python37\site-packages\spacy\util.py", line 131, in load_model
    return load_model_from_package(name, **overrides)
  File "C:\Users\nldijkm8\AppData\Roaming\Python\Python37\site-packages\spacy\util.py", line 152, in load_model_from_package
    return cls.load(**overrides)
  File "C:\Users\nldijkm8\AppData\Local\Continuum\anaconda3\lib\site-packages\en_core_web_lg\__init__.py", line 12, in load
    return load_model_from_init_py(__file__, **overrides)
  File "C:\Users\nldijkm8\AppData\Roaming\Python\Python37\site-packages\spacy\util.py", line 190, in load_model_from_init_py
    return load_model_from_path(data_path, meta, **overrides)
  File "C:\Users\nldijkm8\AppData\Roaming\Python\Python37\site-packages\spacy\util.py", line 173, in load_model_from_path
    return nlp.from_disk(model_path)
  File "C:\Users\nldijkm8\AppData\Roaming\Python\Python37\site-packages\spacy\language.py", line 791, in from_disk
    util.from_disk(path, deserializers, exclude)
  File "C:\Users\nldijkm8\AppData\Roaming\Python\Python37\site-packages\spacy\util.py", line 630, in from_disk
    reader(path / key)
  File "C:\Users\nldijkm8\AppData\Roaming\Python\Python37\site-packages\spacy\language.py", line 781, in <lambda>
    deserializers["tokenizer"] = lambda p: self.tokenizer.from_disk(p, exclude=["vocab"])
  File "tokenizer.pyx", line 391, in spacy.tokenizer.Tokenizer.from_disk
  File "tokenizer.pyx", line 437, in spacy.tokenizer.Tokenizer.from_bytes
  File "C:\Users\nldijkm8\AppData\Local\Continuum\anaconda3\lib\re.py", line 234, in compile
    return _compile(pattern, flags)
  File "C:\Users\nldijkm8\AppData\Local\Continuum\anaconda3\lib\re.py", line 286, in _compile
    p = sre_compile.compile(pattern, flags)
  File "C:\Users\nldijkm8\AppData\Local\Continuum\anaconda3\lib\sre_compile.py", line 764, in compile
    p = sre_parse.parse(p, flags)
  File "C:\Users\nldijkm8\AppData\Local\Continuum\anaconda3\lib\sre_parse.py", line 930, in parse
    p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, 0)
  File "C:\Users\nldijkm8\AppData\Local\Continuum\anaconda3\lib\sre_parse.py", line 426, in _parse_sub
    not nested and not items))
  File "C:\Users\nldijkm8\AppData\Local\Continuum\anaconda3\lib\sre_parse.py", line 536, in _parse
    code1 = _class_escape(source, this)
  File "C:\Users\nldijkm8\AppData\Local\Continuum\anaconda3\lib\sre_parse.py", line 337, in _class_escape
    raise source.error('bad escape %s' % escape, len(escape))
re.error: bad escape \p at position 257 ```

Did you update your spaCy models? And what does it say when you’re running python -m spacy validate? From the error, it sounds like your models might be incompatible with the latest cersion of spaCy.

Had to update the pre-trained statistical models, all good now!:grinning: