NER or PhraseMatcher?

I am attempting to do text classification on customer relations emails. We’re in the transport space and we have many known rail/tram stations, bus stops, bus routes etc. Since these are known things, I’m thinking we should be using phrase matching to just pluck them out? However I wonder if I am missing something, particularly in relation to abbreviations, spelling mistakes and bus routes (which are typically just numbers)… I should be looking at pos or ner too?

I realise it isn’t strictly Prodigy related, but I want to understand if it can help with this problem.

Hi! This is actually a really good question. Statistical models are useful if your application needs to be able to generalise based on the context. But we always advocate for careful consideration whether a use case really needs a model, or whether it might actually be better off with a rule-based system, or even a combination of both.

The two main arguments for a rule-based approach are:

  • you already have a large lexicon of terms
  • there’s a more or less finite number of instances of a given category (train stations are a good example of that)

If misspellings and spelling variations are important, you might be able to improve your results by combining a rule-based system with a statistical model. spaCy’s entity recognizer respects pre-defined entities (e.g. set manually by previous pipeline components) and will use them as constraints for its predictions. So if you’ve trained a model to detect public transport stops and add it after your rules, it will only find and predict entities in the text that haven’t been labelled yet, potentially helping you find entities your rules missed. My pull request here shows a built-in pipeline component for this which we’re planning on shipping with spaCy v2.1.x.

You might find that this is a pipeline you don’t want to run in your production system and only in development to validate and improve your rules. Periodically, you could train an NER model on data extracted by your rules, run it over unseen text and then compare the output to see if your model turned up entities your rules don’t currently cover.

This approach could also work well if you wanted to branch out to a new city. If you have enough data, you could easily bootstrap a system that can detect different public transport stops that are used in the same or very similar contexts.

If you’re doing information extraction, your application could also take advantage of the other statistical components like the tagger and parser. IMO, these capabilities are often underappreciated. For example, let’s say you also need to extract whether the person is talking about being at a station, going to a station etc. Most of the clues you need for this will be in the syntax, i.e. the dependency parse and the part-of-speech tags. Here’s an example in the displaCy visualizer:

The nice thing here is that you won’t even have to train your own specific categories. spaCy’s tagger and parser are both pretty accurate out-of-the-box. If needed, you can use the Prodigy recipes pos.teach and dep.teach to fine-tune the pre-trained model, in case your data includes constructions that weren’t so frequent in the training data. You can then use spaCy to iterate around the tree and extract the information you need relative to the entities detected by your rules or entity recognizer.


Hi Ines

Thank you for your brilliantly detailed reply. I am still processing it all!

The PR looks perfect place to start for us. I will try it out!

Thank you very much.

1 Like

Hi Ines

I’ve build the PR locally and been trying out EntityRuler. It’s great! However if I use a label/entity types that it’s not seen before (, python terminates with SIGSEGV (Address boundary error) errors.

Your PR test uses new labels, so I’m not sure if I’m doing something wrong or found a bug. Can you help?

Cheers Patrick

Yay, nice to hear that the new component is useful!

That’s strange and definitely shouldn’t be happening! C-level errors like this are always particularly interesting, since they point to deeper bugs. Even if you technically “did something wrong”, spaCy should always fail more gracefully than that.

Do you have a minimal test case that shows the problem? Are you using the component with a pre-trained model? And if so, did you download one of the new alpha models for v2.1.0? (The PR targets the develop branch, i.e. the upcoming spaCy v2.1.0.)

Do you have a minimal test case that shows the problem? Are you using the component with a pre-trained model? And if so, did you download one of the new alpha models for v2.1.0? (The PR targets the develop branch, i.e. the upcoming spaCy v2.1.0.)

I will put something together. But basically I cloned, checked out develop, merged the PR, build it (, run python install, removed spacy 2.0, downloaded alpha models. Doing this also breaks prodigy ner.* recipes.

It terminates with alpha models. With pre-trained models I get a traceback.

  File "pipeline.pyx", line 1061, in spacy.pipeline.TextCategorizer.Model
TypeError: Model() takes exactly 2 positional arguments (1 given)```

This looks correct – however, keep in mind that spaCy v2.1.0 includes various changes to the models and we haven’t really tested it with Prodigy yet. So if you want to use the new EntityRuler, I’d suggest doing it in an isolated environment and just using it from spaCy directly. If you still get an error here, I’d definitely be interested in a test case so I can have a look and see what’s wrong. (You can also post that on the spaCy issue tracker directly if you like.)

Hi Patrick, hi Ines,

If I understand this correctly, PhraseMatcher before NER would transform matched words/phrases into an entity and NER will not touch it afterwards. Is it possible to get this information into the statistical model?
Let’s assume you want to detect both the name oft the customer and a street name. Let’s further assume that the name is “Karl Meyer” and the street name is “Richard Wagner Straße”. PhraseMatcher with a big list of male and female names as well as common last names would match “Karl” and “Richard” plus “Wagner” and “Meyer”. A gazetteer of street names would also tag the street correctly. The statistical model could use this information to make better decisions.


Yes, that’s correct. The entity recognizer will respect already existing entity spans set by previous pipeline components. Their boundaries are used as constraints for the model’s predictions.

If you have good rules, you could also use them to bootstrap training data for your model and improve the entity recognizer, without having to label anything from scratch. This would then allow you to go beyond your rules and be able to label, say, “May-Ayim-Ufer 9”, even if none of those components were part of your gazetteer. Here’s an example of a possible workflow:

  1. Create gazetteers for your categories and write rules to handle ambiguity (e.g. "Richard Wagner " vs. “Richard Wagner Straße”).
  2. Add your rule-based component to your spaCy pipeline, parse lots of text and extract the text plus entities.
  3. Load the data into Prodigy and run ner.manual to see the entities and correct them if necessary. If your rules are good and 90% accurate, it means you only have to change something about 10% of the cases. So this should be super quick.
  4. Use the created data as gold-standard training data for your model.

Did anyone mention here before, that you two do a really great job?

Great job and thank you for that! :wink:


Absolutely amazing job! However, more real-world examples and workflows (on GitHub or blogposts) will be very helpful as most of questions are common.

Trying to figure out if there is a straightforward way of accomplishing this step:
extract the text plus entities

Tried this because it didn’t seem very hard:

doc = nlp(text)

spans = [
    dict(label=ent.label_, start=ent.start, end=ent.end)
    for ent in doc.ents

return json.dumps(dict(text=text, spans=spans))

When I run $ prodigy ner.manual I get the following error:

ValueError: Mismatched tokenization. Can’t resolve span to token index 1. This can happen if your data contains pre-set spans. Make sure that the spans match spaCy’s tokenization or add a ‘tokens’ property to your task.

I can see that I am missing the tokens from the README, but before I tackle that, I figured it would be worth asking since it seems like it would be something potentially available or I am just approaching this completely incorrectly.


Your workflow sounds correct and by default, Prodigy will take care of tokenizing the text for you. The add_tokens preprocessor will use the existing data and if no "tokens" are present, it will try to align the existing span annotations with the tokens present in the data.

The error you’ve encountered happens if it can’t manage to do this, because none of the tokens map exactly to the character offsets defined in your data. For example, let’s say your annotations look like this:

{"text": "The order number is ID-12345", "spans": [{"start": 23, "end": 29, "label": "ID_NUMBER"}]}

Essentially, you’ve labelled the string "12345" as an ID_NUMBER. The problem is that when you tokenize the text with spaCy, the text isn’t actually split in a way that would make "12345" its own token:

nlp = spacy.load('en_core_web_sm')
doc = nlp("The order number is ID-12345")
print([token.text for token in doc])
# ['The', 'order', 'number', 'is', 'ID-12345']

A case like this would then cause the “mismatched tokenization” error. The reason Prodigy lets you know about this is that a) it won’t be able to render the existing annotations in ner.manual and b) you won’t be able to easily train a model from it out-of-the-box that performs the way you expect it to.

If you updated the default English model with the example above, it could correctly learn that a token "12345" in that context is likely to be an ORDER_ID. However, it might never actually come across a token like this, because the tokenizer doesn’t split the text accordingly.

That’s why Prodigy tries to let you know early on if your tokenization doesn’t match the expected output. One solution would be to update spaCy’s tokenization rules to match your expected tokenization. Tokenizer rules are serialized with the model, so you can save out the nlp object using nlp.to_disk() and load that modified model with Prodigy. Alternatively, if you just want to label things and don’t care about spaCy’s tokenization, you can also provide the "tokens" property on your task that tells Prodigy how the text should be tokenized and rendered.

Thanks @ines for the information. I figured Prodigy would handle the tokens appropriately.

My problem was much simpler, I used the start/end attributes on the entity instead of the start_char/end_char attributes. A little coffee, confidence, and sleep goes a long way…

spans = [
     dict(label=ent.label_, start=ent.start_char, end=ent.end_char)
     for ent in doc.ents

return json.dumps(dict(text=text, spans=spans))

Works great now!

thanks, Ian

Ohhh, yes, I didn’t notice that either! :woman_facepalming: Glad it all works now!

1 Like

I ran into a strange case that gives me a segmentation fault in EntityRuler. I posted this to the existing GH thread here. Let me know if you want me to open a new bug ticket:

1 Like

@imaurer Thanks! A separate issue on the spaCy tracker might be nice, yes – the entity ruler is still experimental, so it’s definitely possible that there are a few bugs that still need to be addressed.

Created bug ticket:

The Entity Ruler does work for about ~380k other patterns that I currently have, so it certainly is functional. A really odd case that I took out for now.

1 Like