Questionable results from NER - we must be doing something wrong

My main questions are at the bottom, the rest is context for the questions.

So, we just had a two week sprint where we tried out a bunch of new stuff (Spacy and Prodigy among other things).

To sum up, NLP is hard. But I have a feeling that we probably made a few rookie mistakes, since we are rookies at NLP.

A few guys from my team were tasked with information extraction from legal documents. Information we wanted to extract were:

  • Manufacturer Serial Number of an aircraft (MSN). Example: “MSN 1298”
  • Contract numbers. Example “la120229.4311”, often placed in particular places surrounded by similar words, from doc to doc.
  • Contractual date - often a document has several dates, but we wanted to identify a particular date, usually placed in relation to signature boxes etc.
  • Mention of specific parties, who is entering into the agreement. This is sometimes pretty obscure company names, with weird endings e.g. “My Fancy Business 35, S.A.”
  • Monetary values. Dollar amounts written in numbers, or as text “Seventy five thousand United States Dollars”

Our dataset for training was a large collection of legal documents specific to our industry.

Our process were as follows:

  1. Create a patterns file, with entries such as { "label": "CONTRACT_NUMBER", "pattern" : [ { "lower": "la120229.4311" } ] }, {"label":"CLIENT","pattern":[{"lower":"Golden Fantastic Airlines Public Co. Ltd."}]}
  2. Use ner.teach to create a new entity type for each information we want to extract.
  3. Annotate some documents, however, we did feel at this stage that Prodigy had difficulties asking relevant questions to move forward on training.

As we felt Prodigy wasn’t asking correct questions, we tried ner.manual to manually tag entities. I’m not sure we got enough annotations done, but when training the models we didn’t see very good performance.

So we tried instead to use names of different parties we have from manually entered meta data on the documents, to find sentences they appeared in, then made Spacy training data that each had a sentence and the start/end index of a specific entity.

This turned out to be a little better, but we still got quite a lot of mistakes, such as the word “constitute” being predicted as a date.

We probably did more in the heat of the battle, tried different things just to see if it provided better results.

I think our mistakes were mostly training data related, but I’m curious to learn a few things:

  • Would we have been better off using ner.make-gold and filter out dates that wasn’t the contract date, parties that weren’t the specific type we wanted, money values that were numbers instead?

  • Is it even possible to teach an entity type to find contract numbers, given there’s little predictability in the way they’re constructed?

  • How about MSN numbers, those are just numbers, but often mentioned in a specific context e.g. “ATR 72-600 Manufacturer Serial Number 1298”.

  • Does it make a difference if you train Spacy on sentences, and do predictions on larger collections of text? Like an entire page? Would it do differently if predictions were made on one sentence at a time?

Hi! Thanks for sharing your process and workflow. The problems you describe are definitely non-trivial, so I hope you're not too discouraged by the results so far. I think the general approach you choose made sense, but there are a few potential issues:

There are a few problems I see with your patterns here: First, keep in mind that those are exact match patterns. So the first pattern will only match exact occurrences of tokens whose lowercase text is identical to "la120229.4311". So unless this is a super common example in your data, you'll likely won't see any matches here. Instead. it makes more sense to work with more abstract token attributes, like the shape, e.g. token.shape_, which could be something like "ddxd.d" (digit digit alpha period digit).

You also want to make sure to verify that spaCy's tokenization matches the tokens defined in the patterns. Patterns are token-based, so each entry in the list should represent one single token and their attributes. This is also the reason why your second pattern will never match: there won't be a token whose lowercase text matches the string "Golden Fantastic Airlines Public Co. Ltd.", because spaCy will split this into several tokens:

nlp = spacy.load('en_core_web_sm')
doc = nlp("Golden Fantastic Airlines Public Co. Ltd.")
print([token.text for token in doc])
# ['Golden', 'Fantastic', 'Airlines', 'Public', 'Co.', 'Ltd.']

You might find our Matcher demo useful, which lets you construct match patterns interactively, and test them against your text:

That's difficult to say and really depends on the data and results. Especially since some of your problems were likely caused by the suboptimal patterns. I'd say that based on your descriptions, there are mostly three possible solutions and annotation strategies:

  1. Use ner.teach with better patterns that produce more matches. This will make it easier to move the model towards the desired definitions.
  2. Try ner.make-gold with all labels that you need (e.g. MONEY, DATE, ORG and your new types like CONTRACT_NUMBER). This way, your training data will include both your new definitions as well as entities that the model previously got right. This can prevent the so-called "catastrophic forgetting", and it lets you train with ner.batch-train and the --no-missing flag, telling spaCy that the annotations cover all entities. This can produce better accuracy, because non-annotated tokens are considered "not an entity", instead of "maybe an entity, maybe not, don't have data for it".
  3. Start with a blank model instead of a pre-trained model and teach it about your categories from scratch. This might require slightly more data, but it also means that the pre-trained weights won't interfere with your new definitions. If you write enough descriptive patterns for entity candidates, you can still use the ner.teach recipe to collect training data. Alternatively, you could use ner.manual to create a gold-standard set from scratch.

The best way to find out which approach works best is to start trying them – this is often what it comes down to, and allowing fast iteration and experiments was one of the key motivations for us to develop Prodigy :slightly_smiling_face:

I'd also recommend holding back some of your data and using ner.manual to create a gold-standard evaluation set. This will make it easier to reliably compare different approaches you try, and figure out which training dataset produces the best accuracy. By default, Prodigy's batch-train recipes hold back a certain percentage of your training data, which is okay for quick experiments and an approximation. But once you're getting more serious about finding the best training approach, you usually also want a dedicated evaluation set.

In theory, yes – assuming that the conclusion can be drawn from the local context. This is where statistical NER can be very powerful, because it lets you generalise based on similar examples.

However, if you're updating a pre-trained model, it's not always the best approach to try and teach it a completely new definition of an entity type. For example, it might not be very efficient to try and teach the pre-trained model that the tokens "MSN 1298" should not be analysed as ["O", "U-CARDINAL"] (outside an entity, unit entity) but instead as ["B-SERIAL_NUMBER", "L-SERIAL_NUMBER"] (beginning and last token of an entity). Instead, it might make more sense to solve this by writing token-based rules that check for "MSN" followed by one or more number tokens.

Similarly, the following example might also be a better fit for a combination of predicting the DATE entity type and then using rules or a separate statistical process to determine whether it's a contractual date or not:

So instead of annotating an entirely new category CONTRACT_DATE that conflicts with the existing DATE label, you probably want to try improving the existing DATE label on your data so it makes a little errors as possible, and then adding a second process on top to label it as the subtype CONTRACT_DATE. See this thread on nested labels for an example.

The following threads discuss similar approaches and ideas for combining statistical models with rule-based systems:

If you haven't seen it already, you might also like @honnibal's talk about the iterative development approach and how to find the right approach for data collection. The crime / victim / crime location example shown at 11:38 is actually kinda similar to your "contractual date" type.

Shorter units of text are usually easier to work with, and also make annotation more efficient, since you can focus on smaller chunks at a time. So if you have control over the incoming data, and you can decide what the model should see at runtime, you might as well use single sentences. Just make sure that the training data is similar to what the model will see at runtime and vice versa.

Thank you very much, for your thorough reply! There’s a lot to digest here, so I’m very encouraged to try out different approaches to make our models better.

The issues around our patterns makes total sense when described so clearly.

I will check back later with how it all panned out.

Thanks again!

1 Like

So, just a quick update on a much better result, after improving the patterns file :slight_smile:

I decided to begin with the CONTRACT_NUMBER model from scratch, and started out querying known contract numbers in our database, then pushed token.shape_ into a set to figure out the unique shapes of a contract number we’re dealing with.

I then created the patterns file here:

{"label":"CONTRACT_NUMBER","pattern":[{"SHAPE": "XXXdddd"}]}
{"label":"CONTRACT_NUMBER","pattern":[{"SHAPE": "XXdddd"}]}
{"label":"CONTRACT_NUMBER","pattern":[{"SHAPE": "XXdddd.X"}]}
{"label":"CONTRACT_NUMBER","pattern":[{"SHAPE": "XXdddd.dddd"}]}
{"label":"CONTRACT_NUMBER","pattern":[{"SHAPE": "XXdddd.Xdddd"}]}
{"label":"CONTRACT_NUMBER","pattern":[{"SHAPE": "XXdddd.dd"}]}
{"label":"CONTRACT_NUMBER","pattern":[{"SHAPE": "XXdddd.ddd"}]}
{"label":"CONTRACT_NUMBER","pattern":[{"SHAPE": "XXdddd.dddd"}]}
{"label":"CONTRACT_NUMBER","pattern":[{"SHAPE": "XXdddd.dddX"}]}
{"label":"CONTRACT_NUMBER","pattern":[{"SHAPE": "XXdddd.dddXX"}]}
{"label":"CONTRACT_NUMBER","pattern":[{"SHAPE": "XXdddd.ddddXX"}]}
{"label":"CONTRACT_NUMBER","pattern":[{"SHAPE": "XXXdddd.dd"}]}
{"label":"CONTRACT_NUMBER","pattern":[{"SHAPE": "XXXdddd.ddd"}]}
{"label":"CONTRACT_NUMBER","pattern":[{"SHAPE": "XXXdddd.dddd"}]}
{"label":"CONTRACT_NUMBER","pattern":[{"SHAPE": "XXXdddd.dddX"}]}
{"label":"CONTRACT_NUMBER","pattern":[{"SHAPE": "XXXdddd.dddXX"}]}
{"label":"CONTRACT_NUMBER","pattern":[{"SHAPE": "XXXdddd.ddddXX"}]}
{"label":"CONTRACT_NUMBER","pattern":[{"SHAPE": "XXddddXddd"}]}
{"label":"CONTRACT_NUMBER","pattern":[{"SHAPE": "XXddddXXddd"}]}
{"label":"CONTRACT_NUMBER","pattern":[{"SHAPE": "XXddddXdddd"}]}
{"label":"CONTRACT_NUMBER","pattern":[{"SHAPE": "XXddddXXdddd"}]}
{"label":"CONTRACT_NUMBER","pattern":[{"SHAPE": "XXddddxddd"}]}
{"label":"CONTRACT_NUMBER","pattern":[{"SHAPE": "XXddddxxddd"}]}
{"label":"CONTRACT_NUMBER","pattern":[{"SHAPE": "XXddddxdddd"}]}
{"label":"CONTRACT_NUMBER","pattern":[{"SHAPE": "XXddddxxdddd"}]}
{"label":"CONTRACT_NUMBER","pattern":[{"SHAPE": "dddd.ddd"}]}
{"label":"CONTRACT_NUMBER","pattern":[{"SHAPE": "XXdddd"},{"SHAPE": ".dd"}]}
{"label":"CONTRACT_NUMBER","pattern":[{"SHAPE": "XXdddd"},{"SHAPE": ".ddd"}]}
{"label":"CONTRACT_NUMBER","pattern":[{"SHAPE": "XXdddd"},{"SHAPE": ".dddd"}]}
{"label":"CONTRACT_NUMBER","pattern":[{"SHAPE": "XXdddd"},{"SHAPE": "dd"}]}
{"label":"CONTRACT_NUMBER","pattern":[{"SHAPE": "XXdddd"},{"SHAPE": "ddd"}]}
{"label":"CONTRACT_NUMBER","pattern":[{"SHAPE": "XXdddd"},{"SHAPE": "dddd"}]}
{"label":"CONTRACT_NUMBER","pattern":[{"SHAPE": "XX"},{"SHAPE": "dddd"}]}
{"label":"CONTRACT_NUMBER","pattern":[{"SHAPE": "XX"},{"SHAPE": "dddd.ddd"}]}
{"label":"CONTRACT_NUMBER","pattern":[{"SHAPE": "XX"},{"SHAPE": "dddd.dddd"}]}
{"label":"CONTRACT_NUMBER","pattern":[{"SHAPE": "XXdddd.dd"},{"SHAPE": "dd"}]}
{"label":"CONTRACT_NUMBER","pattern":[{"SHAPE": "XXdddd.ddd"},{"SHAPE": "d"}]}
{"label":"CONTRACT_NUMBER","pattern":[{"SHAPE": "XXdddd.X"},{"ORTH": "/"},{"SHAPE": "Xd"}]}
{"label":"CONTRACT_NUMBER","pattern":[{"SHAPE": "XXddddX"},{"ORTH": "/"},{"SHAPE": "Xd"}]}
{"label":"CONTRACT_NUMBER","pattern":[{"SHAPE": "XXdddd"},{"ORTH": "-"},{"SHAPE": "dd"}]}
{"label":"CONTRACT_NUMBER","pattern":[{"SHAPE": "XXdddd"},{"ORTH": "-"},{"SHAPE": "ddd"}]}
{"label":"CONTRACT_NUMBER","pattern":[{"SHAPE": "XXdddd"},{"ORTH": "-"},{"SHAPE": "dddd"}]}
{"label":"CONTRACT_NUMBER","pattern":[{"SHAPE": "XXXdddd"},{"ORTH": "-"},{"SHAPE": "dd"}]}
{"label":"CONTRACT_NUMBER","pattern":[{"SHAPE": "XXXdddd"},{"ORTH": "-"},{"SHAPE": "ddd"}]}
{"label":"CONTRACT_NUMBER","pattern":[{"SHAPE": "XXXdddd"},{"ORTH": "-"},{"SHAPE": "dddd"}]}
{"label":"CONTRACT_NUMBER","pattern":[{"SHAPE": "XXdddd.X"},{"ORTH": "/"},{"SHAPE": "Xd(XXX"},{"SHAPE": "dddd"},{"ORTH": ")"}]}

I then started annotating using:

prodigy ner.teach contract_number_ner en_core_web_lg ..\data\merged.txt --loader txt --label CONTRACT_NUMBER --patterns ContractNumbers.jsonl

After several sessions, I ended up with ~3500 annotations, most of which were rejections. I presume this is normal (majority being rejections)?

It slowly starts finding actual contract numbers, then goes bad for a while, then a batch of good ones and so on. Probably just a coincidence, and related to the annotation data being used.

Every once in a while, it revealed patterns that I didn’t cover in my initial patterns file, usually typos or OCR mistakes, with added extra space etc. I decided to add those new formats to my patterns file, and start another annotation session. The Matcher demo page was very useful to figure out the correct shape of tokens.

I then initiated the ner.batch-train command, and got the following output:

(morph_ner) C:\Workspace\Dev\morph_ner\contract_number>python -m prodigy ner.batch-train contract_number_ner en_core_web_lg --output models\v1 --n-iter 15 --eval-split 0.2 --dropout 0.2 --no-missing

Loaded model en_core_web_lg
Using 20% of accept/reject examples (652) for evaluation
Using 100% of remaining examples (2608) for training
Dropout: 0.2  Batch size: 16  Iterations: 15

BEFORE     0.000
Correct    0
Incorrect  897
Entities   863
Unknown    0

#          LOSS       RIGHT      WRONG      ENTS       SKIP       ACCURACY
01         15.420     29         7          31         0          0.806
02         14.272     28         7          29         0          0.800
03         14.365     32         4          34         0          0.889
04         14.044     32         5          35         0          0.865
05         14.063     32         6          36         0          0.842
06         14.519     32         5          35         0          0.865
07         14.499     32         4          34         0          0.889
08         14.540     32         3          33         0          0.914
09         13.927     32         4          34         0          0.889
10         14.149     32         3          33         0          0.914
11         14.422     30         5          31         0          0.857
12         14.333     30         5          31         0          0.857
13         14.303     32         3          33         0          0.914
14         13.789     32         3          33         0          0.914
15         14.079     32         3          33         0          0.914

Correct    32
Incorrect  3
Baseline   0.000
Accuracy   0.914

Model: C:\Workspace\Dev\morph_ner\contract_number\models\v1
Training data: C:\Workspace\Dev\morph_ner\contract_number\models\v1\training.jsonl
Evaluation data: C:\Workspace\Dev\morph_ner\contract_number\models\v1\evaluation.jsonl

(morph_ner) C:\Workspace\Dev\morph_ner\contract_number>

I’m very happy with those stats :smile:

One thing I notice now, is the fact that all my shapes has the leading characters uppercase, XX and not xx - this means that when testing afterwards, it doesn’t recognize la120215.4322 whereas LA120215.4322 is recognized just fine - which makes sense, since I probably didn’t do a single annotation with lowercase la.

I’m not sure if I should create a blank model in Space, save it and then run ner.batch-train with that, instead of using en_core_web_lg?

So far, very positive results! The initial learning curve for me is definitely how to get Prodigy to find the right stuff in the annotation source, so that it can provide relevant tasks.

Thanks for updating – glad to hear the results are promising so far! :+1:

Depending on the data and category, this is definitely possible. Keep in mind that the ner.teach will focus on showing you examples that the model is most uncertain about and it will use an exponential moving average to calculate that threshold as you annotate. So if there are analyses of the text that the model is already super confident about, you likely won’t even get to see those.

One thing you probably want to do to get a more reliable evaluation is to create an evaluation set using ner.manual, so you can evaluate on gold-standard parses instead of parts of the held-back binary annotations. This will also help you determine whether you’re running the risk of teaching the model that “pretty much all suggestions are incorrect” and whether you need more “positive” examples.

Yes, this would also be consistent with the features used by the model: the PREFIX (first character), SUFFIX (last 3 characters), SHAPE and NORM. (See this video for more details on the NER model.)

This mostly depends on whether you need the existing pre-trained categories or not. If not, it’s always better to train a new model from scratch, to avoid potential conflicts and results that are harder to interpret. The pre-trained models have seen tens of thousands of examples, so updating them with a few thousand of a new category is more difficult than training a new “blank” model from those annotations.

If you do want to use the pre-trained model, I’d recommend using the en_core_web_sm for quick experiments, simply because it’s faster. The batch-train recipes serialize the best model on each epoch to bytes and keep it in memory, so you’ll always get the best model of the whole run at the end.

I tried using a blank model, but for some reason it gave me accuracy of 0.00 when running ner.batch-train after doing a little more than 1000 annotations.

When creating the blank model, I got the error described here, I checked my spaCy version which is 2.0.12 - so I worked around that and ended up with this:

from __future__ import unicode_literals, print_function
from pathlib import Path 

import shutil
import spacy

def main(output_dir=None):
    nlp = spacy.blank('en')  # create blank Language class
    print("Created blank 'en' model")

    if 'ner' not in nlp.pipe_names:
        print("Adding ner pipe")
        ner = nlp.create_pipe('ner')
        nlp.add_pipe(ner, last=True)

    ner = nlp.get_pipe('ner') = 'en_core_web_lg.vectors'
    optimizer = nlp.begin_training();
    losses = {}

    other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
    with nlp.disable_pipes(*other_pipes):  # only train NER
        for i in range(1):
                [],  # batch of texts
                [],  # batch of annotations
                drop=0.5,  # dropout - make it harder to memorise data
                sgd=optimizer,  # callable to update weights

    # save model to output directory
    if output_dir is not None:
        output_dir = Path(output_dir)

        if not output_dir.exists():

        nlp.meta['name'] = 'blank_ner_model'  # rename model

        print("Saved model to", output_dir)

if __name__ == '__main__':

I tried en_core_web_sm, and while it is faster to work with, accuracy and annotation quality suffers a little (as far as I can see).

I generated match patterns for Manufacturer Serial Number (MSN) to feed into Prodigy, and after doing ~600 annotations and training using en_core_web_lg I am up to 92,6% accuracy for the MSN model.

This will probably be my next attempt at getting it to >95% accuracy