I do a standard classification using prodigy textcat.batch-train. The source document is basically the gold standard, and all sentence-label answers are “accept”. This will cause the catastrophic forgetting.
I’ve see one example in which the author solves this by copying the sentence, giving it another label, and reject the answer. A document characterized by 4 labels becomes 4 times as long, and 75% of the answers are rejects. I did this approach in one test. The 1st step had an F-score of 0.95, and the loss dropped by only 15% from 1st to last epoc. So somehow I’m not comfortable with this: you create certainty that is not supported by the document.
I was thinking of creating artificial incorrect answers; some percentage of the sentence-label pairs are rejected. You introduce randomness and prevent the catastrophic forgetting; however, you need more data to get a good predictive power of your model.
Another approach I’m considering is to add some 10% extra sentences with a (randomly selected) incorrect label, and reject that answer. So a mix of the first two approaches. I feel this is closest to the Prodigy approach where an annotator will have a certain number of labels wrong.
Or, last possibility, is this catastrophic forgetting a complete non-issue since you use anyway a drop fraction of 15%-25%
If I understand correctly, you’re streaming in documents, and you’ll encounter one document that has all sentences which will receive the label IN_TRUE_DOC. Then you’ll get through all of those, and in the next document you might get the wrong answer, so you’ll label all those samples IN_FALSE_DOC. I think you’re right that this will make learning difficult. Ideally for stochastic gradient descent to work, you want to be drawing an i.i.d. sequence of samples for your updates.
If the above is correct — if you have two classes, IN_TRUE_DOC and IN_FALSE_DOC — then I can think of a couple of solutions. One is to do the sentence splitting as a pre-process, so that you can shuffle over the sentences. This loses you the document context though, which may make the annotation task too difficult (on the other hand, if a lot of document context is needed, the model will probably struggle too). Another approach is to change the sentence-level classification task so that you’re only marking the sentences which determine that the doc is actually true. Irrelevant sentences get labelled as the other class. This works well for some problems, but not for others.
One thing I’m actually unsure about from your question though is, do you have any negative examples at all? If you don’t have any examples which naturally represent the negative class, then I’d say you have much more fundamental problems than the “catastrophic forgetting” problem. You really need to be able to sample from the distribution you’ll see at test time, in order to create training and evaluation annotations. You can often compromise on this to some extent for the training data, because knowledge often transfers well from one distribution to a slightly different one. The requirement is much more urgent for the evaluation data, though. If you can’t get evaluation data that represents what you’ll need the model to do at run-time, then it’ll be really difficult to reason from your experiments.