Improving an existing model or adding a new entity type to it introduces the catastrophic forgetting problem and pseudo rehearsal is a way to work around this. But I observe that the entities predicted by en_core_web_sm on my dataset are actually wrong. What is the recommended workflow to create the p.r. data?
Should we use a portion of the corpus that the model to be improved (e.g. en_core_web_sm) was trained on so that it catches the right entities?
I assume all entity types that the original model contains should be included in the p.r. data?
If we do not need the whole set of entity types that original model has, can we remove some of them? (My first try on en_core_web_sm - without mixing in any p.r. data - recognized many tokens as WORK_OF_ART. Why spend resources to remedy this problem in the first place if I will not ever use it?)
Unfortunately there’s still an element of guesswork in the PR workflow — we’re refining things and hope to solidify it into an easy recipe. For now, consider these the best-guess answers:
If you do have the source text used to train the original model, then yes training on it would be helpful.
You should use all the entities the model predicts, yes. If you start training for a subset of the entities, the weights in the model will need to change to adjust to your new behaviour. We want to have a target that says “produce exactly what you did before”.
One of the benefits of pseudo-rehearsal is that you can just generate the training data, by running the original model. So it shouldn’t take resources to train these extra entity types.
I think PR can be a confusing idea, and the recommendation is a bit vague — it’s not clear exactly how much unlabeled text to use, how to select the examples, whether to upsample the new annotations, etc.
It might help to consider that in the paper that introduced the term, PR was used with completely random inputs, to each layer. The idea is that you want to learn the same input-to-output function as the original network.
I’ve never actually tried using random activations, because the function I care to replicate is over sentence input. But I think it’s useful to consider how the idea applies to random activations, to give the intuition that what we’re learning from don’t necessarily need to be correct annotations.
All that said, there’s a flaw in the way spaCy currently supports PR. Ideally we should want to preserve the probabilities output by the softmax layer of the model, and learn to replicate exactly those probabilities. Instead, we currently treat the annotations produced by the initial model as though they were manual annotations. So, we’re not quite learning the same function that produced the initial output. The original function might have assigned 0.55 to class A and 0.45 to class B. During pseudo-rehearasal, we take A as correct, so if the model assignes 0.55 to class A, we’ll suffer a loss, and change the solution.
I would try taking many sentences from your target domain, and performing one epoch of updates over them, with a low learning rate and a large batch size. Evaluate your accuracy before and after these updates, to find out whether your model got worse during this self-training process. Once you have a set of hyper-parameters that you know doesn’t change your accuracy too much, you can incorporate your new annotations. Now try to increase the learning rate so that you learn from the new annotations effectively.