Multi-word entity seeding, entity context


(Radim) #1

Hi, I’m evaluating Prodigy for NER annotations and struggling with the workflow. Can you please help?

I read the tutorials but still don’t understand how to kickstart the process. The tutorials are asking me to either:

  1. Start with an existing model and keep training to refine it (not an option).

  2. Enter some seed words or seed patterns. The entities are not single words, and are often unique strings (OOV), so seed words don’t make sense (I guess?). I have no idea how to enter a seed pattern for the type of entities we need—the example options I saw (a fixed number of lemmas, POS) seem to crude. The entities are defined mostly by their context (words around them) and a little bit by their shape (capitalization) and length. So intuitively, I’d expect to be defining these signals to help the active learning process. Trying to define the entities by their tokens doesn’t look like a feasible approach.

Can I simply supply pre-annotated examples to start the process? Would that work with Prodigy’s active learning?

Or maybe taking a step back: is Prodigy the right tool for annotating own NER of this type? What are its limitations, conceptually?

(Matthew Honnibal) #2

Hi Radim,

The third option here is to start with just the ner.manual interface, which is definitely the most efficient for some situations. You can also import annotations with prodigy db-in, to populate a dataset with pre-annotated sentences. Once you have the dataset, you can use prodigy ner.batch-train to train a model on them. You can also train an NER model with spaCy separately, e.g. using spacy train, to create the initial NER model.

Taking a step back, the idea of boot-strapping the model with rules is more efficient in some cases than others. It’s especially efficient when you have a fairly focussed semantic class. If you’re annotating that’s more like the slot labels of a semantic frame, the patterns-based approach is a bit less powerful. That said, you can still often get value out of more abstract patterns. The patterns file takes a .jsonl format, where each line describes a spaCy matcher rule. For instance:

{"label": "GENE", "pattern": [{"shape": "XXXX"}]}

This pattern rule would suggest the label GENE for tokens that have the shape of four upper-case letters. Obviously this will give you lots of false positives, but clicking through those should be pretty fast.

You can construct the patterns dictionary in a separate script, as it’s pretty easy to output the patterns (just use json.dumps() on a Python dict). Sometimes it’s helpful to build a dictionary of likely head words, and another dictionary of likely modifiers. For instance, let’s say you want to classify sports teams, like “San Francisco Giants” or “New York Yankees”. You could train some word vectors that have longer phrases, either using mutual information or by pre-recognising entities etc. But another way would be to simply use the team-names like Giants, Chargers, Bulls etc as the seed terms, and create a list with just those words. Then separately make another list with places, and then make a list with the product of the two.

The baseline strategy of just using ner.manual to tag the first 1000-2000 entity occurrences also works quite well. If the entity density is low, you’ll need to click through a lot of sentences — but clicking through sentences with no entities is very quick, as it’s just a single “accept” decision.

(Radim) #3

Thanks Matt, that’s a lot of info. Just to make sure I understand:

  1. If I go the ner.manualroute, there’s no active learning.
  2. The only way to use active learning is to specify word seeds or word rules (no way to start from existing annotation examples).
  3. The rules don’t allow for describing (“seeding”) variable-length (multi-word) entities, or entities that depend primarily on their context (rather than in-entity words or features).

Is this correct?

Is there a list of signals (features) that the Prodigy model uses to do its spotting? Just so I have a better idea of what it can/cannot learn, in principle.

(Radim) #4

Ping @honnibal – this is currently blocking, I’m unsure how / if Prodigy applies to my annotation task. Thanks.

(Matthew Honnibal) #5

@piskvorky Just got back from travelling – sorry about the delay.

Not quite. I think the situation’s a bit simpler than it’s coming across. Let me try again to explain.

As soon as you have an NER model trained, you can improve it with ner.teach, and make use of the active learning. You can get an initial NER model with any of the following approaches:

  1. Use ner.manual, followed by ner.batch-train
  2. Use ner.teach with a patterns file
  3. Use external annotations, and spacy train
  4. Create a custom recipe, that sets annotations in some other way.

Any of these approaches to boot-strapping will work. Taking a step back and thinking about the requirements, the goal is to get some process that suggests entities in context with high recall. If you can get to that point, you can click through the suggestions really quickly. The answers you make can then be used to train a better model, until you can finally bootstrap to the point where the model is pretty accurate, and you can create gold-standard data (with complete and correct annotations on whole sentences) very quickly. Once you’ve got a gold-standard corpus, you can train and evaluate with any NER system.

You can definitely create rules that reference longer phrases. You might find this demo of the rule-based matcher useful: . The -pt argument to ner.teach takes a jsonl file, where each line is a matcher rule. A matcher rule is a list of token patterns, where each token pattern specifies attribute/value constraints and optionally an operator. So you can write patterns that reference the annotations like part-of-speech, entity label, depedency tag etc.

(Radim) #6

Welcome back @honnibal :slight_smile:

I was specifically wondering about seeding Prodigy with existing (known) rules/annotations, to build that initial model. The answer seems to be “train a model separately with spaCy first”, since the Prodigy seeding patterns do not support context (I think).

Would you mind giving some hints as to what signals are used internally in the model? What information goes into the spotting, conceptually?

(Matthew Honnibal) #7

You can provide Prodigy any arbitrary Python generator — so you can definitely do that yourself. Have a look at the recipe scripts provided in the Prodigy installation, in prodigy/recipes. You can modify these or make your own custom recipes that wrap the existing ones:

I’m not sure what you’re asking. Are you wondering what features are used in spaCy’s NER model? It’s a depth-4 CNN, so in theory each word vector’s representation becomes sensitive to the surrounding context up to 4 words previous or following.


Hi this has just become a blocker for me. Please excuse me if you have answered this but what I take from @piskvorky is this…when I bootstrap using seeds terms is there a way of using sequences of tokens as seeds…so where something like the following is straight forward from the tutorials:

prodigy terms.teach … --seeds “headache, tumor”

what if I want to something as follows:

prodigy terms.teach … --seeds “heart attack, cancer”

What I am after is active learning throwing up sequences of tokens rather than just singular tokens when annotating on the basis of my seeds. IN ADDITION, is there a built-in recipe that would take the resulting annotations and produce a patterns.jsonl.

Loving prodigy btw, very powerful.

(Ines Montani) #9

To find similar terms, the terms.teach recipe will iterate over the vector model’s vocabulary and compare the vector of the vocabulary entry to the target vector. spaCy’s en_core_vectors_lg and many other pre-trained vectors you can download only contain single tokens so they’re not going to have a vector for “heart attack”. So your example would work in theory, but not in practice.

Maybe you’ll be able to find medical word vectors that were trained on merged noun phrases. Alternatively, you could pre-process your text with spaCy, merge phrases like “heart attack” into single tokens and then train your own vectors, e.g. using Gensim. You might also want to check out sense2vec, which shows a similar approach.

Yes, that’s the recipe! Note that this recipe currently also expects that each term is one token. If you want to create multi-token patterns, you’d have to create those yourself:

{"label": "MEDICAL_CONDITION", "pattern": [{"lower": "heart"}, {"lower": "attack"}]}


I came across the terms.train-vectors recipe…and tried the following:

prodigy terms.train-vectors ./models raw.json
-spacy-model en_vectors_web_lg
-la en

my raw.json is a list of dicts with a “text” field. Here is my trace:

13:06:06 - ‘pattern’ package not found; tag filters are not available for English
13:06:06 - collecting all words and their counts
Traceback (most recent call last):
File “/home/haroon/anaconda3/lib/python3.6/”, line 193, in _run_module_as_main
main”, mod_spec)
File “/home/haroon/anaconda3/lib/python3.6/”, line 85, in _run_code
exec(code, run_globals)
File “/home/haroon/anaconda3/lib/python3.6/site-packages/prodigy/”, line 331, in
controller = recipe(args, use_plac=True)
File “cython_src/prodigy/core.pyx”, line 211, in prodigy.core.recipe.recipe_decorator.recipe_proxy
File “/home/haroon/anaconda3/lib/python3.6/site-packages/”, line 328, in call
cmd, result = parser.consume(arglist)
File “/home/haroon/anaconda3/lib/python3.6/site-packages/”, line 207, in consume
return cmd, self.func(
(args + varargs + extraopts), **kwargs)
File “/home/haroon/anaconda3/lib/python3.6/site-packages/prodigy/recipes/”, line 99, in train_vectors
File “/home/haroon/anaconda3/lib/python3.6/site-packages/gensim/models/”, line 783, in init
File “/home/haroon/anaconda3/lib/python3.6/site-packages/gensim/models/”, line 759, in init
self.build_vocab(sentences=sentences, corpus_file=corpus_file, trim_rule=trim_rule)
File “/home/haroon/anaconda3/lib/python3.6/site-packages/gensim/models/”, line 936, in build_vocab
sentences=sentences, corpus_file=corpus_file, progress_per=progress_per, trim_rule=trim_rule)
File “/home/haroon/anaconda3/lib/python3.6/site-packages/gensim/models/”, line 1591, in scan_vocab
total_words, corpus_count = self._scan_vocab(sentences, progress_per, trim_rule)
File “/home/haroon/anaconda3/lib/python3.6/site-packages/gensim/models/”, line 1560, in _scan_vocab
for sentence_no, sentence in enumerate(sentences):
File “/home/haroon/anaconda3/lib/python3.6/site-packages/prodigy/recipes/”, line 33, in iter
for sent in doc.sents:
AttributeError: ‘NoneType’ object has no attribute ‘sents’

Should I be sending nlp.doc objects? Not sure how I would do that from the command line. Any help appreciated.

Also the documentation on this recipe says it can be used with a sense2vec model. Would that mean using such a model (previously trained) for the --spacy-model argument?

And finally I assume that by training my vector model with merged entities and noun phrases that will NOT result in terms.teach with seeds resulting in multi-token questions? (what I am ultimately after) Since if I understand it correctly spacy/prodigy will tokenize my seeds on whitespaces.

(Matthew Honnibal) #11

I think you’ve hit a gap in our error messaging here, so thanks for the traceback. We try to make sure both spaCy and Prodigy report good errors, and avoid this type of situation where you hit an arbitrary “thing doesn’t exist” sort of error at some point down the pipeline. It’s not easy to think of everything up front though, so it’ll always be a continuing process.

The particular problem here is that the en_vectors_web_lg model doesn’t have any analysis components in its pipeline. This means there’s no sentence boundary detection, and also no NER and no noun phrase identificaton. You’re asking for merged entities and merged NPs, which aren’t available.

The solution should be as simple as using the en_core_web_lg model instead.

Let’s say you have a sentence like I like Pink Floyd and experimental electronica. The merge NER and merge NPs options basically pre-process this into something like I like Pink_Floyd and experimental_electronica, so that you get a single token for the merged phrases. (It doesn’t insert underscores, I just did that for illustration). You’ll then have terms for Pink Floyd and experimental electronica in your vocabulary, and these terms will have a vector assigned to them. You may be suggested these whitespace-delimited terms in terms.teach, and you can use them as seed terms.


that fixed it! thanks much.

Got it. So my approach is going to be

1 Train my customised model with noun and entities merged.
2 Use that model in terms.teach, which will result in active learning throwing up some multitoken possibilities (since active learning ranges over the vocabulary in the model)
3 When creating a pattern.jsonl I will need my own recipe that takes my multi-token annotations and turns them into a tokenized pattern that can be consumed by prodigy’s other recipes that ask me to annotate in context.

I’ll post my process from start to finish, incase anyone finds it useful, if the above approach works.

Thanks much!

One other thing I noticed when using terms.train-vectors, not really a big deal, but when you pass output_model to in a script, its important to render this as a pathlib.Path instance and not a string object…otherwise you get an output_model does not have an exist error…this is not a problem when working from the command line.