Debugging NER - batch_train with custom dataset

I am using the NER batch_train recipe code to train a model, where I have customized it so I can bring in my own dataset.

I have defined my own evaluation set of 43 examples (i.e. not a random split), where I know each example includes two spans/entities, one of Category 1 and one of Category 2.

However when I go to train my model, I see the following:

It seems to be evaluating against 70 entities, where I expected 86 (43*2). The model trains correctly (no errors), but hits only 60% accuracy, below what I would expect. The 70 vs 86 issue makes me think I am incorrectly coding something when I construct my dataset.

How do I debug this? Ideally I would like to see what 70 entities the model is evaluating against (and therefore which ones it is missing), but also open to other suggestions/advice?

Hi! Thanks for the report. This is definitely surprising, yes. It can sometimes happen that the total number of examples differs, since Prodigy merges annotations on the same input text and splits sentences (if you're not setting --unsegmented). But that doesn't seem to be what's going on here.

Are you training with the --no-missing flag (and the assumption that the entities in the data are the only entities that occur in the text and all other tokens are not entities)? If not, it's possible that the result you see here is a side-effect of how the evaluation works if we assume that all unannotated tokens are missing values. Although, I also find it surprising that the number here is lower than the number of annotated spans in your evaluation (because even with missing values, that's the minimum we know about the correct parse).

Here are a few things to check and try to get to the bottom of this:

  • When you specify an output directory for the model, Prodigy will also save a training.json and evaluation.json, containing the examples it ultimately trained and evaluated on. So you could let it run for one iteration, save out the model and make sure that the examples in evaluation.json are correct and what you'd expect.
  • Do you have an "answer" key set on the individual spans in your evaluation data? If not, does anything change if you explicitly set "answer": "accept" on all spans?
1 Like

Thanks for the reply @ines. I do have answer": "accept" and the evaluation.json looks correct.

Changing to the --no-missing flag had an effect.. see below

Is there an explanation somewhere of the mechanics of the NER model and/or its evaluation? I am having trouble interpreting these results at the moment, as basic as what does "RIGHT", "WRONG" and "ENTS refer to exactly ? I will dig into the code if need be, but thought there might be a write-up somewhere (I can't find anything)?

Apologies that the output is a little unintuitive. It's to accommodate the case where you have incomplete information in the dataset.

The entities number is the total number of entities predicted by the model. The correct number is the number of correct predictions, and incorrect is the number of mistakes, including both false positives and false negatives.

So in your figure above, before any training the model predicts 305 entities, all of which are wrong. It seems there are a further 70 gold-standard entities, none of which it predicts --- so the total number of mistakes is 375, and 0 predictions are correct.

@honnibal Thanks! So to clarify the way I would read iteration 1 of the model, is that...

  1. The model is predicting 3 entities
  2. There are zero correct predictions, and 73 incorrect predictions (false positive and false negative)
  3. Therefore I can infer there are 70 false negatives (73-3 )

Is that correct?

In this case, as I have stated above, I KNOW my validation dataset has 86 entities, therefore there would seem to be a problem with the way it is being read in? (Because the false negative (70) + true positive (0) = 70)

Sorry for the delay getting back to you on this --- I missed the reply as we've been travelling for PyCon India.

Yes, the evaluation seems to think you only have 70 entities, so it does appear that there's something wrong. Are any examples repeated in your dataset? The only thing I can think of is that there might be conflicting examples, or perhaps the tokenisation has changed so there are examples that don't align to token boundaries?