Understanding ner.batch-train stats

I am trying to understand what ner.batch-train recipe reported stats mean and how best to use it.

Background - I want to use NER to recognize two labels - ARTIST and WORK_OF_ART on a bunch of music video titles. The ARTIST label is new but I decided to re-purpose the WORK_OF_ART label that already exists in some models. Many of the titles in source text contain both labels, some only one of them and some none at all.

I have a 500 titles that I manually annotated with Prodigy (that was surprisingly super quick and slick task), most of them ‘accept’ an some ‘reject’ answers. I split them into two parts for training and evaluation (used fixed split so the results can be compared between models). I wanted to run these trough ner.batch-train in order to determine which model is the best choice as a starting point. The following models were tested:

  • blank
  • en_core_web_sm (with and without NER)
  • en_core_web_md (with and without NER)
  • en_core_web_lg (with and without NER)

Question #1 - Is it safe to assume that the model that gives me best accuracy after training with initial annotations is the best model to work with?

Running ner.batch-train on these with en_core_web_sm gives:

Training model: en_core_web_sm...
Using 2 labels: ARTIST, WORK_OF_ART

Loaded model en_core_web_sm
Loaded 250 evaluation examples from 'eval_dataset'
Using 100% of remaining examples (240) for training
Dropout: 0.2  Batch size: 32  Iterations: 10


BEFORE     0.045
Correct    12
Incorrect  255
Entities   334
Unknown    15


#          LOSS       RIGHT      WRONG      ENTS       SKIP       ACCURACY
01         3.088      28         239        1598       0          0.105
02         3.075      56         211        1221       0          0.210
03         2.509      68         199        1524       0          0.255
04         2.192      87         180        1162       0          0.326
05         2.040      105        162        1314       0          0.393
06         1.864      107        160        1405       0          0.401
07         1.788      117        150        1286       0          0.438
08         1.659      122        145        1402       0          0.457
09         1.540      135        132        1254       0          0.506
10         1.458      128        139        1366       0          0.479

Correct    135
Incorrect  132
Baseline   0.045
Accuracy   0.506

Question #2 - What are the initial stats mean? The correct/incorrect values are the number of individual entities (not samples) that NER got on the evaluation dataset before training?
Then what exactly Entities and Unknown values mean?

Now, if I run ner.batch-train on a en_core_web_sm model with NER disabled or even on a blank modal, I get something like this:

BEFORE     0.034
Correct    9
Incorrect  258
Entities   534
Unknown    525

Question #3 - How can it have correct answers on NER labels with a blank model or model without NER?

Question #4 - The idea was to have initial set of annotation done manually and then use ner.teach to continue collecting samples. But even with the model with the highest accuracy after initial training (en_core_web_md with NER disabled, accuracy ~0.65) ner.teach gives mostly totally irrelevant and meaningless results (over ~90% reject rate), even after over a 1000 answers, it hasn’t improved in its suggestions. To the contrary, using ner.make-gold and manually correcting the labels gives much better results. Does it make sense? Why?

Question #5 - Just to reassure I’m using the right approach, the plan was:

  • Have an initial amount of manually annotated labels
  • Train a model with those annotations (use the model with best accuracy)
  • Use ner.teach with that model to continue adding answers (until…?)
  • Combine all collected annotations (manual and from ner.teach) to train a model until I get reasonable results

Does the above make sense? Generally what accuracy can I hope to reach in this scenario?

Thanks!

2 Likes

There’s a risk here: the model only knows that something isn’t an artist if it has a “reject” label, or if the annotations on the text have been marked complete. In the extreme case where you had only “accept” answers, the model can’t learn at all — it may as well predict that every word is an example of the entity!

To fix this you can do a round of ner.manual using the annotations in your existing dataset as a starting point. This will let you convert your existing dataset from partial to full annotations, so the model knows what isn’t an entity. (Actually the workflow for this could be smarter: we should use your annotations to constrain a model, which would predict its best guess. I’ll consider updating ner.make-gold to support this option.)

Not 100% safe no, if your evaluation data only has partial labels. Consider the case where you had 90% accept, 10% reject. There’s still a lot of predictions the data can’t evaluate, so one model might learn to dramatically over-predict, while memorising the specific rejects in your data.

Entities tells you how many total entities the model predicted. Unknown is the number of predicted entities that cannot be evaluated given the labels in your dataset.

It’s hard for the model to get started learning something new. If you have few annotations, the space of hypotheses that fit them is enormous. I always think of that thought experiment from Quine: Imagine you’re an anthropologist making first contact with a new people, and trying to figure out their language. A rabbit goes past and they say “gavagai”, so you assume “gavagai” means “rabbit”. But it could mean any number of other things: maybe it means animal. Less plausibly, maybe it means fur. Or maybe the leg of a rabbit, specifically.

We’re always making these inferences that are under-constrained by the data. When you start training a new ML model, it has the same problem — because it can’t guess what you want. So when getting off to a “cold start”, it can be efficient to have labels that are more explicit.

Yes, this makes sense. Try adding some steps of ner.make-gold in there as well at the start, while boot-strapping. You might also be better off making a patterns file to bootstrap with. A terminology list can also be a very good starting point (use terms.teach and terms.to-patterns.

As far as stopping criteria: you can try ner.train-curve to figure out how accurate your model is with 50%, 80% etc of your annotations. This gives some insight into how accuracy will look at 120%, 150% etc.

It’s impossible to say without studying the specific dataset (and at that point it’s often easier just to run the model…). It depends on how much information the model must learn to solve the problem. As a rough picture on this, consider the distribution P(label | word). If this distribution of labels given words is low entropy, i.e. if a few specific words make the labels very likely, the model will learn the task quickly. If your labelling depends on multiple words together in context, so the solution will involve learning a lot of “and” and “xor” relationships, learning will take much more data.

Thank you for the detailed response, it helped to shed some light on these topics.

Can you please clarify what does it mean for annotation to be “marked complete” or “partial vs. full annotations” and how do I make one complete? If my annotations were already created with ner.manual are they considered “complete”?

I assumed that if I marked all existing NER labels on the text, then there is not much information I can add. So anything that wasn’t marked explicitly as a label, is not one in this context (but may be in another).

It’s hard for the model to get started learning something new. If you have few annotations, the space of hypotheses that fit them is enormous. I always think of that thought experiment from Quine: Imagine you’re an anthropologist making first contact with a new people, and trying to figure out their language. A rabbit goes past and they say “gavagai”, so you assume “gavagai” means “rabbit”. But it could mean any number of other things: maybe it means animal. Less plausibly, maybe it means fur. Or maybe the leg of a rabbit, specifically.

We’re always making these inferences that are under-constrained by the data. When you start training a new ML model, it has the same problem — because it can’t guess what you want. So when getting off to a “cold start”, it can be efficient to have labels that are more explicit.

What I was puzzled about is why when using the same trained model, ner.teach was giving me such poor suggestions (most of them had to be rejected), while ner.make-gold was suggesting much more reasonable labels, roughly consistent with the model’s accuracy, so I had to manually fix only about 50% of them. I admit that I still don’t get it, probably because I still lack understanding of some basic spacy/prodigy concepts.

Yes, this makes sense. Try adding some steps of ner.make-gold in there as well at the start, while boot-strapping. You might also be better off making a patterns file to bootstrap with. A terminology list can also be a very good starting point (use terms.teach and terms.to-patterns.

I tired terms.teach at first (after watching the video on training the DRUGS label) and it worked extremely poorly on the source data. I assumed it is because the nature of these labels I’m trying to extract and the fact they don’t have any meaning, or totally different meaning, outside of the specific context of a title that contains song/artist names.

It’s impossible to say without studying the specific dataset (and at that point it’s often easier just to run the model…). It depends on how much information the model must learn to solve the problem. As a rough picture on this, consider the distribution P(label | word). If this distribution of labels given words is low entropy, i.e. if a few specific words make the labels very likely, the model will learn the task quickly. If your labelling depends on multiple words together in context, so the solution will involve learning a lot of “and” and “xor” relationships, learning will take much more data.

Yes, the later is definitely a strong characteristics of this type of data. As expected, song and artist names can be labeled only in specific context while not in another, even when having the exact same tokens.

If you use ner.teach , you’re answering yes / no questions about specific phrases. If you say “yes” to a specific phrase (like “Pink Floyd” labelled as “artist”), we don’t know whether other words in the input are entities too. If you reject, we also don’t know what the real answer was — just that that labelled pair was incorrect.

Yes, annotations from ner.manual and ner.make-gold are assumed to be complete — you look at the whole text and make corrections.

ner.teach doesn’t just ask you about the entities it thinks are most likely — it suggests a mix of things, because if it didn’t, it might get stuck in a state where it’s confidently wrong. So it’s asking you a wider variety of questions intentionally. I also tried to design ner.teach so that the questions would be quick to click through. It’s useful to have negative examples, and clicking “no” to a bunch of stuff in a row can be very quick (I often just spam “yes” or “no” blindly and go back five when a different one has flashed by).

Ahh, ok. That makes a lot of sense. Thanks!

@honnibal well, I thought I got a grasp of it but then I ran into another issue that puzzles me -

When creating annotations (auto generated with code or manually outside of Prodigy), is there any way to mark an annotation “complete”? i.e can I tell the model "this are all the entities in this text, anything else would be incorrect"?

The issue I am facing is that when training a model, batch-train reports very good results (~0.9 accuracy) but when I test the model on the same data it was trained/evaluated with, Spacy does find the correct entities with same high accuracy but produces a lot of false positive “noise” entities (entities it detects in the text but shouldn’t be there and were not present in the annotated data). Over 50% of the test samples contain these.

What’s the best approach to deal with this behavior?

EDIT:
I guess I can can run my complete accepted annotations through a trained model and auto generate examples with all entities that the model finds and are not present in the accepted annotations, marking them as “reject”. Then retrain a model from scratch with the original complete annotations plus the newly created rejected annotations. Is this the way to do this or there’s a better way?

Does this mean that, while training, the model will assume that entities are the ones marked and the rest is not an entity?
Are there use cases when someone would like to mark something in ner.manual and reject?

If you run ner.batch train with the --no-missing flag, then yes. All tokens that are not part of an entity will be assumed to be not part of an entity. Otherwise, non-marked tokens will be treated as missing values.

The recipe / interface lets you assign labels to text – what they mean is up to you. So you could use the “reject” action to create negative examples, or to mark examples with fundamental problems, like wrong tokenization. Also see this thread: