Generating examples in spacy to address catastrophic forgetting

I'm trying to add new entities to the pre-trained 'en_core_web_sm' model in spacy. To avoid catastrophic-forgetting, I created a rehearsal / revision dataset using the original 'en_core_web_sm'. The model predictions, as expected, have errors for text being used.

Should I use these predicted entities AS-IS for training using Pseudo-rehearsal? Or do I need to modify / re-annotate entities so that they represent the ground-truth before combining with my training dataset?

Any insights into this well documented issue are welcome. Thank you.

Hi! This is the idea, yes – basically, you'd be updating the model with exactly what it predicted before, which typically shouldn't make it better or worse. It'll just prevent it from "forgetting" what it previously predicted.

If your dataset is quite small, you could of course also use the opportunity and make some corrections, e.g. using a workflow like ner.correct in Prodigy. Or you could start by looking at a random sample and see where the current predictions are at – maybe there's some low-hanging fruit you can easily fix to also improve the predictions on the existing categories.

@ines Thank you for your reply. I have a follow-up question. I tried to use entities predicted by 'en_core_web_sm' to further train the same model. I was hoping to see that the model would show high scores on validation data. However, I noticed that model performance deteriorated when compared to the base model. Can you please explain what might be causing this? I only used a small training set and default training hyper-params.

Can you share some more details on the training and evaluation data? Does your data include examples of the new categories and the previous predictions? And are you evaluating on a random split, or do you have a separate dedicated evaluation set?

Sure. I do not include the new categories in the training loop. Train, validation sets are generated using a random split.

I sourced the data from here:

Used the following code to get my train, val sets:

# subroutine to generate annotated examples
def generate_examples(df):
    examples = []
    for idx, doc in enumerate(nlp.pipe(df['text'])):
        dct = {}
        if len(doc.ents) > 0:
            dct['text'] = doc.text
            l = []
            for ent in doc.ents:
                l.append([ent.start_char, ent.end_char, ent.label_])
            dct['label'] = l        
    return examples

nlp = spacy.load('en_core_web_sm')

df = pd.read_json('./News_Category_Dataset_v2.json', lines=True)

# random shuffle and extract train, val subsets
rng = list(df.index)
df1 = df.loc[rng[:1000]].copy() #train
df2 = df.loc[rng[1000:2000]].copy() #val

#annotated train examples
train_df = pd.DataFrame(generate_examples(df1))
#annotated val examples
val_df = pd.DataFrame(generate_examples(df2))

train_df.to_json(f'./train_nlp_rehearsal_1000.json', orient='records', lines=True)
test_df.to_json(f'./val_nlp_rehearsal_1000.json', orient='records', lines=True)

I used the above for training, validation. Please find the .json files attached if it helps.
val_nlp_rehearsal_1000.jsonl (202.3 KB)
train_nlp_rehearsal_1000.jsonl (205.8 KB)

This is the performance on the validation set after a 'simple training' loop (50 epochs) with minibatching (using default hyperparameters):

Thanks for the details! And you're importing the data into Prodigy and training with Prodigy, right? If so, this part looks wrong to me:

You're basically setting the "label" in the JSON to something like (1, 2, "FOO"), which is not the format Prodigy expects for named entities, which are represented as a list of "spans":

So it looks like you've basically been updating your model with no annotated spans for all the examples and trying to teach it that there are no entities.

Sorry for the confusion, but I'm training in spaCy. I re-format the data as below and run the training loop. Does this look correct?

train_data = [('eBay Bans Confederate Flags. The website called it a "symbol of divisiveness and racism."',
  {'entities': [(0, 4, 'ORG')]}),
 ('Derek Jeter Fakes Out Opponent To Start Double Play. ',
  {'entities': [(0, 17, 'PERSON')]}),
 ('Brazilian Squatters Offer Shelter From Anti-LGBTQ Violence. “It’s not my fault that I live in a society with an empty heart and mind.”',
  {'entities': [(0, 9, 'NORP')]})]
# Get names of other pipes to disable them during training to train only NER
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']

optimizer = nlp.resume_training()
with nlp.disable_pipes(*other_pipes):
    for i in range(epochs):
        losses = {}
        batches = minibatch(train_data, size=compounding(4.0, 16.0, 1.001))
        for batch in batches:
            examples = []
            texts, annotations = zip(*batch)
            for j in range(len(texts)):
                # create Example
                doc = nlp.make_doc(texts[j])
                example = Example.from_dict(doc, {"entities": annotations[j]['entities']})
            # Update the model
            nlp.update(examples, sgd=optimizer, drop=0.2, losses=losses)
        print (f'losses at iteration {i} - {}: {losses}')

Is what you're updating with in train_data the entire data you're using, or is that just an example? Because with so few examples, you can easily introduce forgetting effects again. You really want to be retraining with a larger corpus and annotations of all the entities, including the old and new ones together.

Also, if you're using spaCy v3, you should really use spacy train with a config file, to make sure you're defining all of the relevant settings (which you're not doing in your basic training loop). You can use Prodigy's data-to-spacy to export all your annotations to spaCy's .spacy format and then run spacy train with those annotations.

You can also train from a directory of .spacy files, so you can export your annotations of the new label and your auto-generated examples of the previous labels as separate files, and then train from all of them together.

Auto-generating your examples will also be a lot easier because you can just save out the spaCy Doc objects that your model already produces:

Sorry about the delay in replying back. Yes, I was trying to train using only a few training examples for understanding catastrophic-forgetting and to learn how to avoid it when training a custom NER model.

It is also helpful to know that model training needs both old and new entities together i.e. in the same doc object and not in mutually exclusive training datasets.

As you've mentioned, I will look into using the spacy train instead of a basic training loop. Thank you for all the feedback.