Train a new NER entity with multi-word tokens

I’m not sure if I’m dealing with a bug or if I’m doing something wrong. But also if it’s the latter case you might be interested in why I’m doing this, so I’m going to write a quite verbose message here that will let you follow my thought process.

My goal is to train a new NER entity with the name DISASTER which will recognize for example floods, storms and volcano eruptions. Yesterday I followed your video in which you train a DRUG entity and got the results I wanted in the end. But now I’m struggling when the pattern of an entity consists of several words, like ‘volcano eruption’.

My first attempt was to create a patterns file with those multi-word tokens

{"label":"DISASTER","pattern":[{"lower":"volcano eruption"}]}
{"label":"DISASTER","pattern":[{"lower":"volcanic eruption"}]}
{"label":"DISASTER","pattern":[{"lower":"volcanic ash"}]}

and use this to train the entity on news articles

prodigy ner.teach disasters_ner en_core_web_lg "eruption" --api guardian --label DISASTER --patterns volcano_patterns.jsonl

It finds the word ‘ash’ in news articles, but not the multi-word token.

So I dove deeper in the spacy documentation and found that those patterns need a different representation and updated volcano_patterns.jsonl to

{"label":"DISASTER","pattern":[{"lower":"volcano"}, {"is_space": true}, {"LOWER": "eruption"}]}
{"label":"DISASTER","pattern":[{"lower":"volcanic"}, {"is_space": true}, {"LOWER": "eruption"}]}

However this still doesn’t recognize the multi-word tokens.

Then I remembered that I read about ner.manual and thought that I could use this to generate the valid patterns for the teaching. So I ran

 prodigy ner.manual volcano_patterns en_core_web_lg "eruption" --api guardian --label "DISASTER"

and in there label multi-word tokens like ‘volcanic ash’ as a disaster. I did this for a few articles and saved the annotations. I thought the next step would be to generate a patterns file from the newly created annotations, so I ran:

(py3) ~/projects/tripler/data-analysis/spacy-ner (master): prodigy volcano_patterns
{"label": null, "pattern": [{"lower": "Bali: Mount Agung volcano monitored after second eruption"}]}
{"label": null, "pattern": [{"lower": "Asp \u2013 or ash? Climate historians link Cleopatra's demise to volcanic eruption"}]}
{"label": null, "pattern": [{"lower": "Bali volcano eruption could be hours away after unprecedented seismic activity"}]}
{"label": null, "pattern": [{"lower": "Bali: travel warning issued as volcano threatens to erupt"}]}
{"label": null, "pattern": [{"lower": "Mount Agung: Bali airport closed as volcano alert raised to highest level"}]}
{"label": null, "pattern": [{"lower": "Bali: Mount Agung volcano monitored after second eruption"}]}
{"label": null, "pattern": [{"lower": "Asp \u2013 or ash? Climate historians link Cleopatra's demise to volcanic eruption"}]}
{"label": null, "pattern": [{"lower": "Bali: Mount Agung volcano monitored after second eruption"}]}
{"label": null, "pattern": [{"lower": "Asp \u2013 or ash? Climate historians link Cleopatra's demise to volcanic eruption"}]}
{"label": null, "pattern": [{"lower": "Bali volcano eruption could be hours away after unprecedented seismic activity"}]}

I would have expected the label to be DISASTER and the pattern to contain only the part of the text that I had marked, like for example ‘volcanic eruption’. I have the feeling that I did not really understand how to correctly use ner.manual.

Can you please point me in the right direction?

Best, Stephan


Hi Stephan – the good news is, your idea sounds feasible and your approach makes sense. You also got the “hard parts” right – but there are a few small issues that caused it not to work:

The reason [{"lower":"volcano eruption"}] doesn’t match is because each dictionary is describing one token. So your instinct of splitting the tokens into multiple dictionaries was correct, but the additional whitespace token isn’t actually necessary. spaCy’s tokenizer will split on whitespace characters – while it does preserve them in the .text_with_ws attribute to make sure no information is lost, they don’t usually end up as one token – unless there are multiple of them. So the term “volcano eruption” will be tokenized as ['volcano', 'eruption']. So your pattern will have to look like this:

[{"lower": "volcano"}, {"lower": "eruption"}]

Adding a {"is_space": true} token means that spaCy will look for a token "volcano", followed by a whitespace token, followed by "eruption", which is almost never the case. So it would match "volcano \n eruption", but not "volcano eruption".

Because the patterns depend on spaCy’s tokenization, you can verify them by running the text through spaCy’s tokenizer, and looking at the individual tokens it produces:

>>> doc = nlp(u"volcano-eruption")
>>> [token.text for token in doc]
['volcano', '-', 'eruption']

Alternatively, Prodigy also supports spaCy’s PhraseMatcher – so instead of token patterns, you can include strings. Internally, those will be converted to Doc objects, so you won’t have to worry about spaCy’s tokenization. You can find more about the patterns.json format in the “Match patterns” section of your PRODIGY_README.html.

{"label": "DISASTER", "pattern": "volcano eruption"} is mostly intended to convert a list of seed terms – so it expects each example to contain a "text" key of the term that should be included in the pattern. This is usually the case if you create your seed terms from word vectors with terms.teach.

Your approach is pretty clever, though! To make it work, you can either check out the source of prodigy/recipes/ and rewrite the recipe to take the text of each entry in the "spans". Or you export the annotations you’ve created with ner.manual to a JSONL file, convert it, save it out and import it to a new dataset, which you can then convert to patterns using

terms = []
for eg in examples:  # the annotations created with ner.manual
    spans = eg.get('spans', [])  # get the annotated spans
    for span in spans:
        text = eg['text'][span['start']:span['end']]  # the highlighted text
        terms.append({'text': text, 'label': span['label']})

You can then save our your terms to JSONL and add them to a new dataset using the db-in command. You’ll then have a set in the same format that’s usually produced when creating the seed terms – for example {'text': 'volcano eruption', 'label': 'DISASTER'}.

1 Like

Hello @ines

thanks a lot for your help. I really appreciate how much time you take to give a detailed answer. That was helpful and I managed to solve my problem by writing a recipe that allows me to export patterns from annotations created via ner.manual.

I will post here the notes to myself that I collected during the process in case someone else runs into the same problem:


First we have to come up with terms that represent disasters. So I create a new dataset (disaster_terms) and then we can use prodigy to find the write terms. I initialized it with the terms from the Triple R database and prodigy will use the GloVe model to suggest similar terms.

prodigy dataset disaster_terms "Seed terms for DISASTER"
prodigy terms.teach disaster_terms en_core_web_lg --seeds "blackout,flood,earthquake,tsunami,storm,fire,avalanche,thunderstorm,flooding,volcano,blizzard"

Saved 150 annotations to database SQLite
Dataset: disaster_terms
Session ID: 2018-01-17_15-22-24

But I also noticed that for the volcano terms some of then consist of multiple words, like volcanic eruption. I will use the ner.manual recipe to get more of those words.

prodigy ner.manual volcano_terms en_core_web_lg "eruption" --api guardian --label "DISASTER"

Export the patterns for teaching

Normally I can simply use the recipe to export terms created by terms.teach. However the multi-word are stored in a different format, so I had to write myself a little exporter recipe.

def to_patterns(dataset=None, label=None, output_file=None):
    Convert a list of seed terms to a list of match patterns that can be used
    with ner.match. If no output file is specified, each pattern is printed
    so the recipe's output can be piped forward to ner.match.
    def get_patterns(term):
        return [{
            'label': span['label'],
            'pattern': term['text'][span['start']:span['end']]
          } for span in term['spans']]

    log("RECIPE: Starting recipe", locals())
    if dataset is None:
        log("RECIPE: Reading input terms from sys.stdin")
        terms = (json.loads(line) for line in sys.stdin)
        terms = DB.get_dataset(dataset)
        log("RECIPE: Reading {} input terms from dataset {}"
            .format(len(terms), dataset))
    if output_file:
        pattern_lists = [get_patterns(term) for term in terms
                         if term['answer'] == 'accept']
        patterns = sum(pattern_lists, [])
        log("RECIPE: Generated {} patterns".format(len(patterns)))
        write_jsonl(output_file, patterns)
        prints("Exported {} patterns".format(len(patterns)), output_file)
        log("RECIPE: Outputting patterns")
        for term in terms:
            if term['answer'] == 'accept':
              for pattern in get_patterns(term):

So here the final step to prepare all patterns for the teaching.

prodigy disaster_terms disaster_patterns.jsonl --label DISASTER
prodigy terms.manual-to-patterns volcano_terms volcano_patterns.jsonl
cat disaster_patterns.jsonl volcano_patterns.jsonl > all_disaster_patterns.jsonl

Nice thread thank you for this nice discussion and taking the time to be detailed. It really helped me out as I want to add a DISEASE label. I did about 2000 annotations and exported them to JSONL and discovered only one words. While I have diseases such as:

  • Primary Ciliary Dyskinesia
  • Pulmonary Arterial Hypertension
  • Pulmonary Hypertension

With abbreviations such as PAH or PCD

I will try this approach thank you again.

@idealley Thanks, I'm glad this thread was helpful! Definitely keep us updated on your progress – your use case sounds very interesting and I'd be curious to hear how you go.

For the abbreviations, you could also try a more general approach and use a pattern like [{"shape": "XXX"}] (token with 3 uppercase letters). This will naturally yield some false positives, but it'll let you annotate both correct matches, as well as tokens that look similar to disease abbreviations but aren't. I remember we tried this approach in a demo for a similar medical category and it worked surprisingly well.

A tool that might also be helpful when working with patterns (especially if you're creating them manually) is our new Matcher Explorer demo. It lets you create token patterns interactively (wtih all available attributes) and test them against a text:

Hello @ines, I will definitely try the shape option and the explorer.

I will also share my progresses.

I have to say that the training works well with abbreviations and multi words patterns

I am using the basic en spacy model as it prodigy did not want load my custom pubmed word 2 vec (I have written the problem in another thread something related to vectors)

I did not yet try the shape that you have proposed.

A post was split to a new topic: Installation on Windows

Thanks so much for sharing this. In case anyone else runs into issues w/ imports, here’s what worked for me at the top of the file (in addition to the -F command line option for running it that I found here):

from prodigy import recipe, recipe_args
from prodigy.components.db import connect
from prodigy.util import log, prints, write_jsonl
import sys
import json

DB = connect()

and for my annotations (where I had accepted examples with no entities in them), I also needed to skip records without a spans entry to avoid a KeyError:

    def get_patterns(term):
        if 'spans' not in term:
            return []
            return [{
                'label': span['label'],
                'pattern': term['text'][span['start']:span['end']]
                } for span in term['spans']]

I too have to thank you!

We’re using custom labels, like EjectFromCar to tag phrases like:


PhraseMatcher looks to be ideal for this task. I had thought about NER, but the fact that the named entity can span multiple tokens seemed to be a bit of an issue. Fortunately, we have a lot of manually annotated examples, so hopefully this will create a model with a fair degree of accuracy.

Hi Stephan,
Can you please post what your disaster_terms data set looked like?



I am struggling with the 2-word vector. I have run through this current thread, and wrote a small script to convert my actual pattern file which looked like this:
{“label”:“plantname”,“pattern”:[{“lower”:“Agrostis capillaris”}]}
{“label”:“plantname”,“pattern”:[{“lower”:“Agrostis castellana”}]}
to this:
{“label”: “plantname”, “pattern”: [{“lower”: “Acaena”}, {“lower”: “anserinifolia”}]}
{“label”: “plantname”, “pattern”: [{“lower”: “Acaena”}, {“lower”: “novae-zelandiae”}]}

I then used ner.teach as follows;
prodigy ner.teach plant_ner en_core_web_md datafile.txt --label plantname --patterns plant_patterns.jsonl

However, it seems like I can only annotate one word, and not two!

Any help would be greatly appreciated.

The problem here is that your patterns are both looking for a first token whose lowercase form matches "Acaena" – which will never be true, because the string starts with a capital letter. So you can either use "acaena", or, if you want case-sensitive matches, use the "orth" key to match the exact orthographical string.

I’d also recommend double-checking the tokenization to make sure it matches the tokenization of the model you’re using. Remember that each dictionary in the pattern describes one token – but if the spaCy model you use tokenizes the text differently than the pattern, the pattern might never match.

For example, I just tested the second example in the small English model and by default, it is split into 4 tokens:

nlp = spacy.load('en_core_web_sm')
doc = nlp("acaena novae-zelandiae")
print([token.text for token in doc])
# ['acaena', 'novae', '-', 'zelandiae']

So for this case and tokenization, the pattern would have to look like this:

{"label": "plantname", "pattern": [{"lower": "acaena"}, {"lower": "novae"}, {"orth": "-"}, {"lower": "zelandiae"}]}

If case-insensitive and explicit token-based patterns aren’t that important for your use case, you could also write exact string match patterns instead. For example, the following will match the exact strings “Acaena novae-zelandiae” and “acaena novae-zelandiae” (but only those).

{"label": "plantname", "pattern": "Acaena novae-zelandiae"}
{"label": "plantname", "pattern": "acaena novae-zelandiae"}
1 Like

Thanks a lot. It worked now.


1 Like

@ines @honnibal I have a requirement wherein I need to annotate only substring of a given pattern for an NER activity.

e.g: for below I need to tag FistName Lastname into my dataset to tag in my dataset to perform NER. As part of this, I have created below patterns file - which detects the entire length of pattern. but not sure how to filter it further to extract only firstname lastname

Current Pattern File: {"label":"HEADER_1", "pattern":[{"LOWER": "our"}, {"LOWER": "trader"}, {"IS_SPACE": true, "OP": "*"}, {"IS_ALPHA": true}, {"IS_ALPHA": true, "OP": "+"}]}

ContactPerson: FistName Lastname @ ABC Holdings Pte Ltd

spaCy's Matcher doesn't currently support something like a "look-around" in the patterns, like you have in regular expressions. It's a feature that a few people have requested, so we'll probably add it sooner or later. But for now the best solution is to work around this in your recipe.

If you use a custom recipe, it should be easy to edit the spans in the stream. You can see an example of a custom recipe that uses the matcher here:

I haven't tested this, but off the top of my head something like the following should work. Here I'm assuming you make match patterns that identify the contexts you want to strip away. There's surely a cleaner way to figure out the tokens to remove than using a set, but I think this should work. You'll probably have some off-by-ones from spaces to consider, though.

def edit_spans_in_stream(context_matcher, stream):
    for eg in stream:
        tokens = [token["text"] for token in eg["tokens"]]
        doc = Doc(context_matcher.vocab, words=tokens)
        context_chars = set()
        for match_id, start, end in prefix_matcher(doc):
            # Get the set of character indices that belong to 
            # the surrounding context
            span = doc[start : end]
            for i in range(span.start_char, span.end_char):
        # Edit the example's entity spans
        for span in eg["spans"]:
            while span["start"] in context_chars:
                span["start"] += 1
            while span["end"] in context_chars:
                span["end"] -= 1
        yield eg