Remarkable Difference Between Prodigy and Custom Training Times

I have been training my NER models using the ner.batch-train recipe, but now I’d also like to train models using spaCy code that I write. The code I write adds nice customizations like checkpointing of the models, but I intend the training to be essentially the same.

I convert the Prodigy style “spans”-as-dictionaries JSONL annotations to spaCy style “entities”-as-tuples. I then run training code whose main loop looks like this:

    optimizer = nlp.begin_training()
    if dev:
        output("      {:15}{:15}{:15}{:15}".format("Loss", "Precision", "Recall", "F-score"))
        output("      {:15}".format("Loss"))
    for i in range(1, epochs + 1):
        need_write = True
        losses = {}
        batches = minibatch(train, size=compounding(4.0, 32.0, 1.001))
        if verbose:
            batches = IncrementalBar("Batches").iter(list(batches))
        for batch in batches:
            texts = [sample["text"] for sample in batch]
            annotations = [{annotation_name: sample[annotation_name]} for sample in batch]
            nlp.update(texts, annotations, sgd=optimizer, drop=0.2, losses=losses)
        if dev:
            with nlp.use_params(optimizer.averages):
                p, r, f = evaluate(nlp.tokenizer, pipe, dev)
            output("{:<6}{:<15.4}{:<15.4}{:<15.4}{:<15.4}".format(i, losses[pipe_name], p, r, f))
            output("{:<6}{:<15.4}".format(i, losses[pipe_name]))
        if checkpoint_interval is not None and i % checkpoint_interval == 0:
            need_write = False

I think this is basically the same algorithm as Prodigy. However, when I do ner.batch-train it takes an hour to run 10 iterations and ends up with a final accuracy on a 20% dev set of 0.72. When I run the code above it completes in 6 minutes and has a final precision (equivalent to accuracy) of 0.26. So presumably I’m doing something very different from Prodigy. Also presumably incorrect.

Is there some fundamental aspect of Prodigy’s training algorithm I’m overlooking here?

Are the Prodigy annotations you’re converting “complete” annotations from e.g. ner.manual, or binary decisions?

If you provide spaCy the binary-style annotations, it doesn’t understand that the rest of the information should be interpreted as missing. It will think the entities you provide it are the only ones in the text — which isn’t correct.

Prodigy does use a different learning algorithm: it uses beam-search with a global model, in order to work with the sparse annotations. The difference between this and the default greedy algorithm in spaCy is rather technical, but the gist of it is that beam-search tracks multiple possible analysis alive at each state.

The annotations are coming from binary decisions solicited by Prodigy, e.g.

 {"text": "the red ball", "answer:"accept", "spans":[(3, 6, "COLOR")]}
 {"text": "the blue sock", "answer:"reject", "spans":[(0, 3, "COLOR")]}

I thought this would be correct because these annotations look like the examples in Training an additional entity type.

How should I pass this style of annotation in to spaCy?

1 Like

Well, if your entity density is very low, you might be able to just assumes that the only relevant annotations are the ones in the spans. The problem I’m talking about would come from something like this:

 {"text": "Google bought YouTube", "answer:"accept", "spans":[(0, 6, "ORG")]}

This would tell spaCy there’s only one entity, which isn’t correct. Incidentally, how are you representing rejection? You can actually tell spaCy “not an ORG” by labelling the entity "!ORG", but I don’t think we’ve documented that anywhere yet (as it’s a pretty unsatisfying format).

Ok. I’m not taking the issue you’re describing into account. I’m making the mistake you show above, i.e. incorrectly failing to label “YouTube” as an ORG.

As for rejects, I’m just throwing them away. I was assuming that for NER, every token not labeled as part of an entity span counts as a negative training example for that entity. Is that correct, or should I be adding in the !ORG annotations as well?

The non-labelled tokens do count as negative examples, but you’re still giving the model a very skewed distribution: you’re giving it the ones you said “Yes” to, but it doesn’t see any data from the stuff you said “No” to. So, the model can find a solution to the data in your training set that won’t hold up to other examples.

Have a look at the ner.print-best recipe. This uses the annotations you’ve made as constraints, and then the parser finds the best compatible analysis. (This is part of the supervision in the beam training).

I think the best approach would be to use these “best parses” and feed them through the manual view. We should really have a setting for this in ner.make-gold, but in the meantime I hope it’s not too hard to stitch the recipes together.