Multi-word entity seeding, entity context

Hi, I’m evaluating Prodigy for NER annotations and struggling with the workflow. Can you please help?

I read the tutorials but still don’t understand how to kickstart the process. The tutorials are asking me to either:

  1. Start with an existing model and keep training to refine it (not an option).

  2. Enter some seed words or seed patterns. The entities are not single words, and are often unique strings (OOV), so seed words don’t make sense (I guess?). I have no idea how to enter a seed pattern for the type of entities we need—the example options I saw (a fixed number of lemmas, POS) seem to crude. The entities are defined mostly by their context (words around them) and a little bit by their shape (capitalization) and length. So intuitively, I’d expect to be defining these signals to help the active learning process. Trying to define the entities by their tokens doesn’t look like a feasible approach.

Can I simply supply pre-annotated examples to start the process? Would that work with Prodigy’s active learning?

Or maybe taking a step back: is Prodigy the right tool for annotating own NER of this type? What are its limitations, conceptually?

Hi Radim,

The third option here is to start with just the ner.manual interface, which is definitely the most efficient for some situations. You can also import annotations with prodigy db-in, to populate a dataset with pre-annotated sentences. Once you have the dataset, you can use prodigy ner.batch-train to train a model on them. You can also train an NER model with spaCy separately, e.g. using spacy train, to create the initial NER model.

Taking a step back, the idea of boot-strapping the model with rules is more efficient in some cases than others. It’s especially efficient when you have a fairly focussed semantic class. If you’re annotating that’s more like the slot labels of a semantic frame, the patterns-based approach is a bit less powerful. That said, you can still often get value out of more abstract patterns. The patterns file takes a .jsonl format, where each line describes a spaCy matcher rule. For instance:

{"label": "GENE", "pattern": [{"shape": "XXXX"}]}

This pattern rule would suggest the label GENE for tokens that have the shape of four upper-case letters. Obviously this will give you lots of false positives, but clicking through those should be pretty fast.

You can construct the patterns dictionary in a separate script, as it’s pretty easy to output the patterns (just use json.dumps() on a Python dict). Sometimes it’s helpful to build a dictionary of likely head words, and another dictionary of likely modifiers. For instance, let’s say you want to classify sports teams, like “San Francisco Giants” or “New York Yankees”. You could train some word vectors that have longer phrases, either using mutual information or by pre-recognising entities etc. But another way would be to simply use the team-names like Giants, Chargers, Bulls etc as the seed terms, and create a list with just those words. Then separately make another list with places, and then make a list with the product of the two.

The baseline strategy of just using ner.manual to tag the first 1000-2000 entity occurrences also works quite well. If the entity density is low, you’ll need to click through a lot of sentences — but clicking through sentences with no entities is very quick, as it’s just a single “accept” decision.

Thanks Matt, that’s a lot of info. Just to make sure I understand:

  1. If I go the ner.manualroute, there’s no active learning.
  2. The only way to use active learning is to specify word seeds or word rules (no way to start from existing annotation examples).
  3. The rules don’t allow for describing (“seeding”) variable-length (multi-word) entities, or entities that depend primarily on their context (rather than in-entity words or features).

Is this correct?

Is there a list of signals (features) that the Prodigy model uses to do its spotting? Just so I have a better idea of what it can/cannot learn, in principle.

Ping @honnibal – this is currently blocking, I’m unsure how / if Prodigy applies to my annotation task. Thanks.

@piskvorky Just got back from travelling – sorry about the delay.

Not quite. I think the situation’s a bit simpler than it’s coming across. Let me try again to explain.

As soon as you have an NER model trained, you can improve it with ner.teach, and make use of the active learning. You can get an initial NER model with any of the following approaches:

  1. Use ner.manual, followed by ner.batch-train
  2. Use ner.teach with a patterns file
  3. Use external annotations, and spacy train
  4. Create a custom recipe, that sets annotations in some other way.

Any of these approaches to boot-strapping will work. Taking a step back and thinking about the requirements, the goal is to get some process that suggests entities in context with high recall. If you can get to that point, you can click through the suggestions really quickly. The answers you make can then be used to train a better model, until you can finally bootstrap to the point where the model is pretty accurate, and you can create gold-standard data (with complete and correct annotations on whole sentences) very quickly. Once you’ve got a gold-standard corpus, you can train and evaluate with any NER system.

You can definitely create rules that reference longer phrases. You might find this demo of the rule-based matcher useful: . The -pt argument to ner.teach takes a jsonl file, where each line is a matcher rule. A matcher rule is a list of token patterns, where each token pattern specifies attribute/value constraints and optionally an operator. So you can write patterns that reference the annotations like part-of-speech, entity label, depedency tag etc.

Welcome back @honnibal :slight_smile:

I was specifically wondering about seeding Prodigy with existing (known) rules/annotations, to build that initial model. The answer seems to be “train a model separately with spaCy first”, since the Prodigy seeding patterns do not support context (I think).

Would you mind giving some hints as to what signals are used internally in the model? What information goes into the spotting, conceptually?

You can provide Prodigy any arbitrary Python generator — so you can definitely do that yourself. Have a look at the recipe scripts provided in the Prodigy installation, in prodigy/recipes. You can modify these or make your own custom recipes that wrap the existing ones:

I’m not sure what you’re asking. Are you wondering what features are used in spaCy’s NER model? It’s a depth-4 CNN, so in theory each word vector’s representation becomes sensitive to the surrounding context up to 4 words previous or following.

Hi this has just become a blocker for me. Please excuse me if you have answered this but what I take from @piskvorky is this…when I bootstrap using seeds terms is there a way of using sequences of tokens as seeds…so where something like the following is straight forward from the tutorials:

prodigy terms.teach … --seeds “headache, tumor”

what if I want to something as follows:

prodigy terms.teach … --seeds “heart attack, cancer”

What I am after is active learning throwing up sequences of tokens rather than just singular tokens when annotating on the basis of my seeds. IN ADDITION, is there a built-in recipe that would take the resulting annotations and produce a patterns.jsonl.

Loving prodigy btw, very powerful.

To find similar terms, the terms.teach recipe will iterate over the vector model’s vocabulary and compare the vector of the vocabulary entry to the target vector. spaCy’s en_core_vectors_lg and many other pre-trained vectors you can download only contain single tokens so they’re not going to have a vector for “heart attack”. So your example would work in theory, but not in practice.

Maybe you’ll be able to find medical word vectors that were trained on merged noun phrases. Alternatively, you could pre-process your text with spaCy, merge phrases like “heart attack” into single tokens and then train your own vectors, e.g. using Gensim. You might also want to check out sense2vec, which shows a similar approach.

Yes, that’s the recipe! Note that this recipe currently also expects that each term is one token. If you want to create multi-token patterns, you’d have to create those yourself:

{"label": "MEDICAL_CONDITION", "pattern": [{"lower": "heart"}, {"lower": "attack"}]}

I came across the terms.train-vectors recipe…and tried the following:

prodigy terms.train-vectors ./models raw.json
-spacy-model en_vectors_web_lg
-la en

my raw.json is a list of dicts with a “text” field. Here is my trace:

13:06:06 - ‘pattern’ package not found; tag filters are not available for English
13:06:06 - collecting all words and their counts
Traceback (most recent call last):
File “/home/haroon/anaconda3/lib/python3.6/”, line 193, in _run_module_as_main
main”, mod_spec)
File “/home/haroon/anaconda3/lib/python3.6/”, line 85, in _run_code
exec(code, run_globals)
File “/home/haroon/anaconda3/lib/python3.6/site-packages/prodigy/”, line 331, in
controller = recipe(args, use_plac=True)
File “cython_src/prodigy/core.pyx”, line 211, in prodigy.core.recipe.recipe_decorator.recipe_proxy
File “/home/haroon/anaconda3/lib/python3.6/site-packages/”, line 328, in call
cmd, result = parser.consume(arglist)
File “/home/haroon/anaconda3/lib/python3.6/site-packages/”, line 207, in consume
return cmd, self.func(
(args + varargs + extraopts), **kwargs)
File “/home/haroon/anaconda3/lib/python3.6/site-packages/prodigy/recipes/”, line 99, in train_vectors
File “/home/haroon/anaconda3/lib/python3.6/site-packages/gensim/models/”, line 783, in init
File “/home/haroon/anaconda3/lib/python3.6/site-packages/gensim/models/”, line 759, in init
self.build_vocab(sentences=sentences, corpus_file=corpus_file, trim_rule=trim_rule)
File “/home/haroon/anaconda3/lib/python3.6/site-packages/gensim/models/”, line 936, in build_vocab
sentences=sentences, corpus_file=corpus_file, progress_per=progress_per, trim_rule=trim_rule)
File “/home/haroon/anaconda3/lib/python3.6/site-packages/gensim/models/”, line 1591, in scan_vocab
total_words, corpus_count = self._scan_vocab(sentences, progress_per, trim_rule)
File “/home/haroon/anaconda3/lib/python3.6/site-packages/gensim/models/”, line 1560, in _scan_vocab
for sentence_no, sentence in enumerate(sentences):
File “/home/haroon/anaconda3/lib/python3.6/site-packages/prodigy/recipes/”, line 33, in iter
for sent in doc.sents:
AttributeError: ‘NoneType’ object has no attribute ‘sents’

Should I be sending nlp.doc objects? Not sure how I would do that from the command line. Any help appreciated.

Also the documentation on this recipe says it can be used with a sense2vec model. Would that mean using such a model (previously trained) for the --spacy-model argument?

And finally I assume that by training my vector model with merged entities and noun phrases that will NOT result in terms.teach with seeds resulting in multi-token questions? (what I am ultimately after) Since if I understand it correctly spacy/prodigy will tokenize my seeds on whitespaces.

I think you’ve hit a gap in our error messaging here, so thanks for the traceback. We try to make sure both spaCy and Prodigy report good errors, and avoid this type of situation where you hit an arbitrary “thing doesn’t exist” sort of error at some point down the pipeline. It’s not easy to think of everything up front though, so it’ll always be a continuing process.

The particular problem here is that the en_vectors_web_lg model doesn’t have any analysis components in its pipeline. This means there’s no sentence boundary detection, and also no NER and no noun phrase identificaton. You’re asking for merged entities and merged NPs, which aren’t available.

The solution should be as simple as using the en_core_web_lg model instead.

Let’s say you have a sentence like I like Pink Floyd and experimental electronica. The merge NER and merge NPs options basically pre-process this into something like I like Pink_Floyd and experimental_electronica, so that you get a single token for the merged phrases. (It doesn’t insert underscores, I just did that for illustration). You’ll then have terms for Pink Floyd and experimental electronica in your vocabulary, and these terms will have a vector assigned to them. You may be suggested these whitespace-delimited terms in terms.teach, and you can use them as seed terms.

1 Like

that fixed it! thanks much.

Got it. So my approach is going to be

1 Train my customised model with noun and entities merged.
2 Use that model in terms.teach, which will result in active learning throwing up some multitoken possibilities (since active learning ranges over the vocabulary in the model)
3 When creating a pattern.jsonl I will need my own recipe that takes my multi-token annotations and turns them into a tokenized pattern that can be consumed by prodigy’s other recipes that ask me to annotate in context.

I’ll post my process from start to finish, incase anyone finds it useful, if the above approach works.

Thanks much!

One other thing I noticed when using terms.train-vectors, not really a big deal, but when you pass output_model to in a script, its important to render this as a pathlib.Path instance and not a string object…otherwise you get an output_model does not have an exist error…this is not a problem when working from the command line.

I'm working on a similar problem and would love to hear about your approach (and get input from @honnibal or @ines if possible). My goal is to ultimately have a model that can identify a new entity of 1 to n words. Here is my approach:

  1. Created custom vectors in Gensim with merged tokens representing the entity I am hoping to create. There are now vectors representing both single- and multi-word examples of my new entity.
  2. Used set_vectors in SpaCy to add those vectors to the en_core_web_sm model (since it does not have vectors out of the box).
  3. Pass the corpus I used to create the custom vectors through a custom SpaCy pipeline that tokenizes the same way my vectors were tokenized (multi-word tokens representing the new entity I eventually want to create). This way I have all of those words in the Vocab associated with my updated en_core_web_sm model that now has custom vectors added. I have confirmed that my new Vocab includes these words.
  4. Save updated en_core_web_sm model to disk so it's accessible with Prodigy.

These next two steps should be the same as yours:

  1. Use terms.teach with some multi-word seed terms that are now represented as vectors in my updated en_core_web_sm.
  2. Export those multi-token annotations to a pattern.jsonl file so I can eventually use that with ner.teach and work my way up to a performant model.

The problem is that when I do step 5, Prodigy is only suggesting single-word tokens from my vocabulary. I've gone through my Vocab and verified that there are multi-word tokens in there, but terms.teach isn't suggesting them. Does any of the above look like I'm off track? Is there a better way to do this?

@tylernwatson Aaah, sorry! This problem has occurred before and I thought we'd fixed it. It's disappointing since you've really done everything right. I hope the problem hasn't cost you much time.

If you look inside the terms.teach source (provided in your Prodigy installation, but also at ), you'll find this line:

lexemes = [lex for lex in stream if lex.is_alpha and lex.is_lower]

That's inside the stream generator. So all your multi-word expressions are failing the is_alpha and possibly the is_lower check, which is what's messing you up. If you change the filtering line so that you more explicitly filter out the examples you don't want, it should all work correctly.

Thanks a lot @honnibal - I modified that to not worry about lex.is_alpha or lex.is_lower, and terms.teach is now doing a great job suggesting multi-word tokens in my custom model's vocabulary that are made of merged entities (band and musician names, in my case). I went through and quickly generated a few hundred annotations that I exported as a patterns file. However, when I then tried to use the patterns file with ner.teach, it seemed to be suggesting everything except what I wanted for my label BAND. Here is the format of entries in my patterns file:

{"label":"BAND","pattern":[{"lower":"LCD Soundsystem"}]}
{"label":"band","pattern":[{"lower":"Gary Numan"}]}

As I go through my source file, ner.teach is suggesting plenty of multi-word spans - they seem to be spans that are not merged multi-word tokens but rather spans of multiple tokens. The model I am using with ner.teach uses the same pipeline that preprocessed all my text before I created my custom vectors, so LCD Soundsystem should be treated as one token when it appears (along with all other band and musician names in my vocabulary). I went through over 500 annotations with ner.teach using this patterns file and rejected every single one - it appears to be offering me every token (and many multi-token spans) in my source text except for the multi-word tokens representing band names. I can't imagine that's the expected behavior. I tried this solution offered here, but since it didn't make a difference I'm guessing this has been addressed already (I'm using Prodigy v1.8.4).

Here are my hypotheses for what might be going wrong - please let me know if one of these sound right to you or if you have other ideas:

  • There is some issue with me adding my custom vectors to en_core_web_sm and I should just add them to a blank model (seems unlikely since there are no existing vectors that they might be clashing with).
  • There is an issue with me trying to represent musician names (which could also be labeled PERSON) as well as band names (which are mostly being picked up by the NER part of my pipeline but mislabeled since BAND doesn't exist yet) with the same NER label BAND. I don't think this would result in both of those BAND "types" to be ignored during ner.teach though given that there are hundreds of examples of both in my patterns file.
  • Maybe there is something wrong with the format of my patterns file and I need to modify The above example looks good to me though - my source file is a corpus of music journalism, so bands are consistently punctuated and capitalized. The example of LCD Soundsystem should always be the way this appears in my corpus and the pipeline should preprocess it to be one token.

It seems like ner.teach should have at least accidentally suggested a band name at this point, and the fact that it appears to be skipping over the actual band names is really confusing. I'm at a loss as to what's happening here, so I'd appreciate your help.

I think the problem here is that the patterns still work on the token level. So vectors for the multi-word expressions may be in the vocab, but the tokenizer will still tokenize incoming text so that "LCD" and "Soundsystem" are two tokens. Your patterns are also trying to match on the Token.lower_ property (lowercase version), but the strings contain capital letters, so the patterns will never match (since there's never going to be a token whose lowercase form matches "Pulp"). So I think you want to rewrite your patterns to reflect the tokenization and property you're matching on – e.g. [{"lower": "lcd"}, {"lower": "soundsystem"}].

Hi there,

I am having a slightly different problem to the above. I created a few match patterns to use with ner.match, including patterns with multiple tokens. However, when I run the patterns through the recipe, those with multiple tokens do the following:

For example, with the term "chicken schnitzel":

It highlights the word "chicken", I press accept. Then in the same text highlights "chicken schnitzel".

I want it to find both "chicken" and "schnitzel" and "chicken schnitzel", but if "chicken schnitzel" exists, I would want to pick that over the individual tokens. Would running word vectors with merge ents and merge noun phrases work for this? And then ner.match with the trained model?

I'm not sure that changing anything about the vectors would help here. What the ner.match recipe highlights comes down to the patterns – so if you have both "chicken" and "chicken schnitzel" in there, you'll see both matches. Some ways you could deal with this:

  • If you see "chicken" first and then "chicken schnitzel", hit undo, go back, reject "chicken" and accept "chicken schnitzel". The matches are typically sorted, so you'll usually see them all in a row, which makes this easier.
  • Post-process your data and filter out the overlapping matches. You can use the input hash to find all examples with the same text, and compare the start/end offsets of the matches to find the longest match if they overlap. You can also use spaCy's doc.char_span and util.filter_spans helper to do this for you,.
  • Write a custom recipe that uses your patterns to find all matches in the data and automatically filters out overlaps (instead of showing you both options). The approaches involving filtering only work, though, if you know that you always want the longest (or shortest etc.) match.

Thanks, @ines.

I used spacy's util.filter_spans outside of Prodigy and it works perfectly. Would there be any way to add it to the terms.match recipe?

Alternatively, I could use it outside of SpaCy, import in the filtered annotations from the terms, and perhaps train a model on these and then use that model with ner.make-gold? Something like that?

Would I need the vectors with ents merged and noun phrases merged to view the multi-token predictions in make-gold or teach? I'm having some issues with terms.train-vectors... I get a value error:

"ValueError: [E102] Can't merge non-disjoint spans. 'the' is already part of tokens to merge. 
If you want to find the longest non-overlapping spans, you can use the util.filter_spans helper:"

For anyone reading this thread, this custom recipe is what I was looking for to incorporate the multi-word patterns into a training set. I didn't realize you could use the SpaCy's EntityRuler with make-gold... very cool! NER Training for Corporate Names

1 Like