Custom ner.batch-train

When I wanted to implement your suggested solution against ‘catastrophic forgetting’, I first had a look at how ner.batch-train is implemented in Prodigy. In ner.py:360 (or somewhere pretty close, I have a few lines of modifications in ner.py) there is a call to model.batch_train and I saw that model is an instance of EntityRecognizer, which comes as a Cython binary. So I checked the Prodigy documentation in PRODIGY_README.html#model-api, but can’t find any documentation of the batch_train method there.

I can guess what it does and it’s not really an issue, just wanted to let you know so you can update the documentation.

Thanks a lot, will fix that!

Thank you!

I also just realized that I now would actually need that part of the documentation. I’m implementing a remedy to the catastrophic forgetting, but don’t really know how to pass the rehearsal data to the batch_train method. I edited the ner.batch-train recipe like so:


[...]

    if len(evals) > 0:
        print_(printers.ner_update_header())

    # Fighting Alzheimer's
    revision_texts = [a['text'] for a in DB.get_dataset('rehearsal')]
    revision_texts = revision_texts[:len(examples)]
    revision_data = []
    for doc in nlp.pipe(revision_texts):
        tags = [w.tag_ for w in doc]
        heads = [w.head.i for w in doc]
        deps = [w.dep_ for w in doc]
        entities = [(e.start_char, e.end_char, e.label_) for e in doc.ents]
        revision_data.append(
            (doc, spacy.gold.GoldParse(doc, tags=tags, heads=heads, deps=deps, entities=entities)))
    training_data = revision_data + examples * 5

    for i in range(n_iter):
        losses = model.batch_train(training_data, batch_size=batch_size,
                                   drop=dropout, beam_width=beam_width)
...

But now realized that these are different formats. The revision_data is a tuple and the examples contain dictionaries with text and spans in it. Is there a way to still use batch_train or will I have to implement my own batching logic with nlp.update in it?

Hello @ines, can you please point me in the right direction regarding my last post?

Hi Stephan,

Thanks for pointing out that method — we’ll get that updated. The signature of the method is:

def batch_train(self, examples, batch_size=32, drop=0., beam_width=16):
    '''
    Perform a training epoch on a dataset, using minibatched SGD.

   examples (list): A list of example records. Each record should be a dict with the keys "text" and "spans". The value of "text" should be a string, and the value of "spans" should be a list of dicts. Each span dict should have the keys `answer`, `start`, `end` and `label`. The value of `answer` should be one of `"accept"`, `"reject"` or `"ignore"`, and `start` and `end` should be character offsets of the span relative to the example text.

    batch_size (int): The minibatch size for the updates.
    drop (float): Dropout rate
    beam_width (int): Number of candidate parses to consider during training. In order to perform an update, the model finds the best parse subject to the annotation constraints. The beam width controls the depth of this search. A wider beam may improve accuracy, but will slow down training, especially on long inputs.
"""

Looking at your code, I think you just need to change the revision data to be a list of dicts like this:

revision_data.append('{'text': doc.text, 'spans': [{'start': e.start_char, 'end': e.end_char, 'label': e.label_, 'answer': 'accept'} for e in doc.ents]})

Let me know how you go — it can be a bit of a fiddly process, figuring out how much data to include etc.

Excellent help, fantastic.

I think my problem why I did not come up with this solution was that I thought the examples somehow have to be provided as a GoldParse. Thanks for pointing out that solution.

I just implemented it and it looks like it does what it’s supposed to do. I only changed your code a little because I think you accidentally moved the answer: 'accept' into the span when it should be for the whole example, right? I post my code here in case someone else is looking for the same solution:

   [...]
    revision_texts = [a['text'] for a in DB.get_dataset('rehearsal')]
    revision_texts = revision_texts[:len(examples)]
    revision_data = []
    for doc in nlp.pipe(revision_texts):
        revision_data.append({
            'text': doc.text,
            'answer': 'accept',
            'spans': [
                {'start': e.start_char, 'end': e.end_char, 'label': e.label_, } 
                for e in doc.ents
            ]
        })
    training_data = revision_data + examples * 5

    for i in range(n_iter):
        losses = model.batch_train(training_data, batch_size=batch_size,
                                   drop=dropout, beam_width=beam_width)

   [...]

@honnibal I realized that you were totally right with you code and I messed it up in the code that I posted. Took my a while to realize that.

But now I’m struggling with what you had mentioned before, which is ‘how much data to include’.

In a first experiment I prepared 500 sentences of ‘rehearsal only’, without any new data. However when I train my model with that rehearsal data, the accuracy drops from 0.96 to 0.66.

Do you have an explanation for that? Is it way to small and I’m overfitting on the rehearsal data?

Can you maybe point me to the right orders of magnitude for the amount of rehearsal data and new examples that I’d need to train an new entity without forgetting the old ones?

Thanks a lot for all your help!

Honestly it’s pretty hard to guess — the best solution is to just try with more data and see what happens. Wish I could be more help!

Ok, I’ll figure it out :wink:

Thanks for still responding and I’ll report back with my results. Thanks