ner.teach not giving relevant entities from patterns jsonl


So I’m trying to create an NER model to extract addresses from freeform text. I’ve created a patterns .jsonl file, which consists of ~60000 distinct patterns that an address could take. For instance, an address like “112 18th St” would be defined by:

{"label": "ADDR", "pattern": [{"shape": "ddd"}, {"shape": "ddXX"}, {"lower": "st"}]}

I then executed the following code:

prodigy ner.match addr_annotations en_core_web_lg data/source.jsonl --patterns data/address_seeds.jsonl

This mindlessly yields matches to the patterns file, but doesn’t cleverly focus on the uncertain matches. Or at least I think it doesn’t, because it seems to be asking about lots of duplicates located in similar surrounding text, even after 1500 annotations. The other shortcoming was that if a string deviates ever so slightly from a defined pattern, no novel suggestions using the model are made. Thus, I decided to give ner.teach a try to avoid these problems, like so:

prodigy ner.teach addr_annotations en_core_web_lg data/source.jsonl --label ADDR --patterns data/address_seeds.jsonl

But to my surprise, the suggested entities seemed to entirely ignore the patterns file and served random words. In particular, even though the patterns file defined multi-token entities, every suggestion from ner.teach was just a single token. Clearly there’s nothing wrong with the patterns file itself, since the addresses were correctly suggested with ner.match.

ner.match output:

ner.teach output with same model, dataset, and patterns file:

Is this the expected behaviour of ner.teach? And if not, could anyone enlighten me as to the cause?

It does take a while for the model to learn the category, so at first it’s expected that the model will suggest a lot of unlikely entities. A little bit of background on how this is working:

The model takes the outputs of the pattern recognizer and the NER model, and interleaves the two outputs. This means it’s trying to show you roughly one suggestion from the model for each suggestion from the pattern matcher. The true matches from the pattern matcher are added as training examples for the model, and the model also learns when you click yes or no to the suggestions.

In the screenshot you’ve attached, the model has only assigned a score of 0.06 to that suggestion — so, pretty low. But if all of the predictions are low, the model still asks you some of those as questions. If we didn’t, we wouldn’t be able to escape from situations where the model is assigning low scores, but is miscalibrated.

So, you do need to click through the examples for a while to bootstrap the model this way. Sometimes the pattern-based bootstrapping really works well, especially if the entity category is easy to learn, and it’s common in your dataset. In other situations, it’s better to use one of the other modes to get initial annotations, e.g. ner.manual or ner.make-gold, and once you have a dataset of correct examples, then train a model with ner.batch-train. Sometimes that’s a better way to get started.

In summary, active learning suffers from a cold-start problem. The initial model doesn’t know enough to make good suggestions, so it’s hard to make the initial progress. The patterns file helps a lot with this, but sometimes the model still struggles, so sometimes another approach is better to get started.

Thanks Matthew, for clarifying. I have some follow up questions:

This means it’s trying to show you roughly one suggestion from the model for each suggestion from the pattern matcher.

In your experience, what is the order of magnitude of examples needed to bootstrap the model via ner.teach till sensible patterns emerge? In the hundreds? Thousands? I’m asking because at one point, I had rejected about 300+ examples of “random” single-token entities, and the model had yet to suggest a multi-token entity. If as you say, ner.teach is trying to interleave results from the pattern recogniser and the model, surely it will at least notice that all my patterns span multiple tokens, and that no single token can ever be an address object? (At least by the 300th example)

The scores also do not seem to be systematically tending toward 0.5 either, and I’m worried it might be bugging out.

In other situations, it’s better to use one of the other modes to get initial annotations, e.g. ner.manual or ner.make-gold

Are my 1500 annotations via ner.match usable in this scenario as well? Or are they fundamentally different from the ner.manual examples when it comes to batch training?

An update: I used ner.batch-train and obtained a 75% accuracy on the new model, which I saved as addresses. I then ran ner.teach with this new model, like so:

prodigy ner.teach addr_dataset addresses data/source.jsonl --label ADDL1 --patterns data/address_seeds.jsonl

Despite being able to predict the address spans correctly during the batch training, the model has once again reverted to serving single tokens here. This time with scores in the 0.4-0.6 range, which should not be happening:

Any ideas?

[Fixed] Update 2:

As it turns out, there’s a weird bug going on that has been previously reported here: Prodigy does not play nice with uppercase labels, and oddly, the solution is to use lowercase labels in the patterns file but leave the label as uppercase while writing out the prodigy command in the terminal.

It has been fixed for me for now, but this behaviour is worth looking into.

Thanks for the update! We’ll definitely get that bug fixed for the next update. Very vexing.

This means it’s trying to show you roughly one suggestion from the model for each suggestion from the pattern matcher. The true matches from the pattern matcher are added as training examples for the model, and the model also learns when you click yes or no to the suggestions.

Just to clarify, is the model supposed to learn live, during the ner.teach process? Or only after I close the annotation window and run ner.batch-train? I’m asking because whilst using a patterns file, ner.teach seems to have the same behaviour as ner.match; that is, it merely yields matches to the patterns and does not seem to be interleaving novel suggestions from the model. Additionally, even if the patterned strings are precise duplicates of one another, the annotation window repeatedly displays these duplicates and does not seem to be preferring more unlikely queries.

I’m pretty certain that something is broken, and my gut feel is that the model’s output is only considered when --label lowercase, while the pattern matcher’s output is only considered when --label UPPERCASE, in the context of a ner.teach command.

I’ve done some tests to narrow this down: After performing my initial set of ~1000 annotations, I ran ner.batch-train and then ran ner.teach on the new model I trained. But whenever the label is spelled in lowercase (i.e. the correct casing), the suggestions were free to differ from the available patterns. Whereas when the label is spelled in uppercase, I only get perfect pattern matches (no different from before batch-train was run), and the score in the bottom corner of the annotation string display is constantly at 0 (which should not be the case, because after ner.batch-train is run, the model should have an idea of what constitutes an address entity).


python -m prodigy ner.teach addr_db_v01 models/addr_model_v01 data/source.jsonl --label addr --patterns data/address_seeds.jsonl --exclude addr_db_v01

returns the same output as

python -m prodigy ner.teach addr_db_v01 models/addr_model_v01 data/source.jsonl --label addr --exclude addr_db_v01

which confirms my suspicions.

These bugs are breaking for me, because it disrupts the core purpose that Prodigy was meant to fulfill in my workflow :frowning:. As of now, the pattern matching is basically just glorified regex with no model input, and it’s frustrating to have spent the past week working around the bugs of this software (see other threads on ner.batch-train failing, etc.). What’s the timeline on the updates to fix this?

Unfortunately, I’m running into this right now too.

Yes, it learns live — calls to model.update are made as each batch is returned. You can turn on PRODIGY_LOGGING="basic" to see the updates that are coming back.

I think the bug that occurred with lower-case labels would arise from part of the code “helpfully” trying to fix a problem where labels are given in lower-case, and then this not matching the upper-case labels in the model. I think this case insensitivity is applied inconsistently, causing the bug mentioned in the other thread.

I honestly feel pretty confused by what you’re reporting. I’m sure you understand that the label has to be written the same in your patterns file and on the command line when you pass the argument. For simplicity, always write it in upper-case.

If the label is written in the patterns file in upper-case, and on the command-line you tell Prodigy to only ask you questions about the lower-cased label, then of course you’ll not get asked any questions from the patterns.

I’m sorry you’ve had a frustrating time getting your system working. It’s a little hard to track the problem across the multiple threads you’ve opened, and it’s hard to have the full picture without working on the same dataset. The issue is that not all sets of user actions are going to result in a working model. This doesn’t necessarily mean the software has a bug — it can just mean that you’ve defined the problems in a way that the model can’t learn, some settings are unideal for the data you’re using, labels are mismatched, etc. Of course, it’s also possible that you’re running into one or more bugs that compound your problems and make the state harder to reason about. But I would suggest it’s unlikely the software is fundamentally broken — we do have a lot of people using it successfully, and we’ve been using it successfully ourselves.

We’re planning to make the labels consistently case-insensitive across Prodigy and spaCy to fix that issue. I’ll also try to provide another example of a healthy ner.teach session so you can see how the dynamics are supposed to work.

@etlweather I think your problem is quite different. @ines has replied here: Pettern length (12) >= phrase_matcher.max_length

I think the bug that occurred with lower-case labels would arise from part of the code “helpfully” trying to fix a problem where labels are given in lower-case, and then this not matching the upper-case labels in the model. I think this case insensitivity is applied inconsistently, causing the bug mentioned in the other thread.

I think you misunderstand. Let me explain what I tried, in order:

  1. Does not output anything vaguely resembling patterns: Uppercase labels in pattern file + uppercase pattern in --label while using ner.teach
  2. Starts outputting patterns, but does no learning. Behaves exactly like ner.match, “score == 0, pattern = {num}” for all suggestions: lowercase labels in patterns file + uppercase pattern in --label while using ner.teach
  3. Throws error message about non-existent label: when annotations from 2. are fed into ner.batch-train with the batch-train’s --label being in uppercase.
  4. No error, successful training with 96% accuracy on 1000 examples: when annotations from 2. are fed into ner.batch-train, but with the batch-train’s --label set to lowercase.
  5. Label displayed in uppercase in annotation GUI regardless of casing in patterns file or --label input

That is to say, the code fails every time the labels are kept in consistent case through the patterns file, to ner.teach, to ner.batch-train. Interestingly, the parsing of the --label parameter seems to differ between ner.teach and ner.batch-train, because I’m forced to use the casing in the patterns file for the latter, but not the former. Additionally:

  1. Well-formed address-like suggestions, unike 1.: Used the model from 4. in ner-teach, with lowercase label in its --label. Same behaviour both with and without reference to patterns file. (This has some retrograde behaviour with the score after a few hundred annotations, hence the seperate thread, but at least this shows that the model has learnt from ner.batch-train)
  2. Throws error message about non-existent label: Used the model from 4. in ner-teach, with uppercase label in its --label, without reference to patterns file.
  3. Again, outputs patterns with no learning, like in 2.: Used the model from 4. in ner-teach, with uppercase label in its --label, with reference to patterns file.

Now this gets interesting. Why is 7. doing the defensive thing and throwing an error, while 2. is not?

Additionally, to me, it seems like this active-learning with interleaving behaviour you mentioned is not happening because the Matcher and ner.teach seems to expect different casings for its labels. Is the problem clearer now?

Okay, I think I’ve got this solved :tada:. Merging in the other thread, to keep things in one place. See my later reply for the resolution.

I have a dataset which I’ve annotated with ner.teach, and then proceeded to perform ner.batch-train on the ~1000 examples. It returned a model with 96.9% accuracy after 10 iterations. I then used this saved model as the model in a new iteration of ner.teach. But when executing

python -m prodigy ner.teach addr_db_v01 models/addr_model_v01 data/source.jsonl --label addr --exclude addr_db_v01

I get <50 well-formed suggestions in the annotation window — and I define well-formed not as being correct, but more loosely as at least vaguely having a form which can be construed as an address (e.g. mistaking “5 mins drive” as an address is well-formed, because there were training examples with "{d} {is-alpha} {"drive"}").

After the initial few reasonable suggestions, the annotation window degrades into asking me if punctuation marks or random stopwords are address entities. This continues for the forseeable future (I’ve tried rejecting >400 annotations just to see if the behaviour would change).

I’m not even sure how this behaviour arose. Any thoughts?


So my intuition told me that this bug is probably related to the issue in this thread:

I decided to try --label ADDR,addr instead of --label addr to circumvent the problem, and the situation seemingly improved! It now gives well-formed suggestions for the first 200 or so times, but eventually degrades to giving full stops a score of 0.5 again after that.

Update 2:

Using --exclude with the existing dataset as shown in my first post, while defining --label ADDR causes Prodigy to hang. Specifically, it prints “Using 1 labels: ADDR”, but then stays there indefinitely and does not start the server on localhost:8080. Not sure if this piece of information will be useful for debugging.


I think the issues here come back to the same things we’re discussing in the other thread, so it makes it a little harder to discuss it in multiple places. I think you have a mismatch between how you’re writing the label, addr. It needs to be cased the same everywhere: in the patterns file, in the training data, on the command line, etc. Use upper-case in all places for simplicity.

If the model is predicting an upper-case label (ADDR), but you’re telling it you only want questions about a different label (addr), you’ll get no questions. This is the expected behaviour.

Finally, note that accuracy on your data collected with ner.teach might not be indicative of how accurate the model should really be, if you run it on an unbiased sample of text. So, it’s possible to have accuracy of 96% in one dataset, but the run the model over unbiased text, and see worse results. You might run the ner.print-best command to see how the model is predicting on new text, to understand its behaviours better.

The reason for the curious casing is explained in the other thread. tl;dr, using consistent uppercase causes buggy output during the initial annotation step. Also, ner.teach expects inconsistent casing depending on model used.

The reason why this thread is separate is because ideally, ner.teach is supposed to be suggesting ambiguously address-like entities and actively learning. Instead, for its first 50 or so suggestions, it performs as promised (and true addresses, where they appear as suggestions, have a score >0.85), but then after a few hundred more annotations, the scores for true addresses drops precipitously (to <0.2). And at some point, they disappear entirely, with punctuation and stopwords miraculously having their score rise to ~0.5.

This happens despite the fact that I have never confirmed anything vaguely resembling those tokens to be address entities within the annotation GUI, and have in fact repeatedly rejected them as they appear.

The fact that well-formed suggestions do appear, at least for the first 50 or so suggestions, seems to suggest that a --label parameter error is not the problem. Because if that were the case, it would have shown nonsensical outputs from the start (haha it’s impossible to guess what an address looks like without referencing the correct label in the model). Instead, we are seeing the scores for the correct entities degrade over time.

N.B: Merged the other thread with this one, as it should now be resolved

Okay, I think I’ve gotten to the bottom of this. There was a regression introduced in v1.5.1 that wasn’t in v1.5.0, concerning the way matches are scored from the pattern matcher. The matcher allows you to set a prior on the number of correct and incorrect matches, to initialize the scores. v1.5.1 changed the way the patterns are indexed, and this caused a problem with the default scores, making the initial scores from the matcher 0. When there are few pattern matches, and active learning is used, this results in no matches from the pattern matcher being selected.

The behaviours you’ve seen are explained as follows. When you had upper-case labels in both the patterns file and the command line (the correct setting), the 0-scores from the matcher were discarded by the active learning, so only questions from the model were shown. The model had no information to start with, so the questions were arbitrary.

When you set a lower-case label on the command line, the model’s matches were not shown, so only questions from the matcher would be presented. If the matcher is seeking lower-cased labels, you’d be asked questions from the matcher — but the model would never learn.

I believe the batch training problem comes down to the same thing. If you batch train with the label in one casing, but then try to teach with the label in a different casing, the model will not learn from the updates, and performance will steadily degrade.

There’s a simple way to mitigate the bug until we can release the next version. In the file, add these two lines after the PatternMatcher is defined, in the teach recipe:

        matcher.correct_scores = defaultdict(lambda: 2.0)
        matcher.incorrect_scores = defaultdict(lambda: 2.0)

Here’s how it should look in context:

        matcher = PatternMatcher(model.nlp).from_disk(patterns)
        matcher.correct_scores = defaultdict(lambda: 2.0)
        matcher.incorrect_scores = defaultdict(lambda: 2.0)
        log("RECIPE: Created PatternMatcher and loaded in patterns", patterns)
        # Combine the NER model with the PatternMatcher to annotate both
        # match results and predictions, and update both models.
        predict, update = combine_models(model, matcher)

You should see pattern matches now coming in with a score of 0.5. As you mark instances from these patterns as correct or incorrect, the score will be updated. You can set a stronger prior if you prefer instead.

Thanks for the fix. Just to make sure everything is working as expected, here is the behaviour after adding the two lines:

  1. With consistent uppercase labels across the patterns file and the commands, everything runs as expected, and the suggestions interleave model suggestions and pattern matches.
  2. However, even after ~1000 annotations, the model is still serving duplicate addresses that match the patterns file but have been accepted/rejected 10-20 times prior, and the scores for the duplicates are maintained at 0.5. (i.e. no active learning seems to be taking place).
  3. I tried running ner.batch-train on the ~1000 annotations I had, to see if the scores would change from 0.5. I obtained 98.9% accuracy over 10 iterations. After, I ran:
python -m prodigy ner.teach addr_db_v02 models/addr_model_v02 data/source.jsonl --label ADDR --patterns data/address_seeds.jsonl --exclude addr_db_v02

But the output was curious. It began by suggesting well-formed, address-like entities. But intermittently, it would also suggest random single tokens with low scores (< 0.15) as address entities. And then as before, it would slowly drift back to scoring stopwords and punctuation >0.5, and stop suggesting address-like spans entirely.

Interestingly, the patterns file did not seem to be detected at all, as never once did I get a prompt in the lower right of the annotation window indicating a pattern match, and the suggested entities with and without --pattern data/address_seeds.jsonl remain the same. Thoughts?

It does sound like the hotfix I gave you isn’t letting the prior update in the matcher, which isn’t correct. It should keep track of which patterns are succeeding or failing, so it can stop asking those questions. I’ll have another look at the hotfix and see if I can improve this.

There are lots of potential explanations here, but probably your model learned to memorise the patterns, and came to a super confident view of the data. Then when you train it, it sees more arbitrary examples and suffers a big loss, making it question everything — so you get a period where the predictions become very low confidence.

Try just doing some manual annotations to get an unbiased dataset. The ner.make-gold recipe could also be useful, as it lets you use the suggestions from a model as a starting point. For something similar using a matcher, you can pipe the output of a script into ner.manual, so you have initial suggestions configured using whatever process you like.