Training the NER pipeline component of an existing model

Hi,

I'm learning from your online course (https://course.spacy.io/en/chapter4)
and am trying to train the NEW pipeline component of an existing pre-trained model (lg/trf).

Is is possible to train all models? I read a few posts claiming it is not possible with en_core_web_trf?

I have run this code:

with nlp.disable_pipes(*unaffected_pipes):

    # Loop for 10 iterations

    for i in range(10):

        # Shuffle the training data

        random.shuffle(TRAINING_DATA)

        losses = {}

        # Create batches and iterate over them

        for batch in spacy.util.minibatch(items=TRAINING_DATA, size=2):

            # Split the batch into texts and annotations

            texts = [text for text, annotation in batch]

            annotations = [annotation for text, annotation in batch]

            # Update the model

            nlp.update(texts, annotations, losses=losses)

    # Save the model

    nlp.to_disk("/content/sample_data")

I have gotten this error:
ValueError: [E989] nlp.update() was called with two positional arguments. This may be due to a backwards-incompatible change to the format of the training data in spaCy 3.0 onwards. The 'update' function should now be called with a batch of Example objects, instead of (text, annotation) tuples.

From reading the documentation on the new update() function, I still don't understand how to apply it.

I also see no mentioning of the minibatch() function.
I used to run this line of code:
spacy.util.minibatch(items=TRAINING_DATA, size=2)
Is minibatching not used anymore?

I also see many posts suggesting the following:

# Import requirements
import random
from spacy.util import minibatch, compounding
from pathlib import Path

# TRAINING THE MODEL
with nlp.disable_pipes(*unaffected_pipes):

  # Training for 30 iterations
  for iteration in range(30):

    # shuufling examples  before every iteration
    random.shuffle(TRAIN_DATA)
    losses = {}
    # batch up the examples using spaCy's minibatch
    batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001))
    for batch in batches:
        texts, annotations = zip(*batch)
        nlp.update(
                    texts,  # batch of texts
                    annotations,  # batch of annotations
                    drop=0.5,  # dropout - make it harder to memorise data
                    losses=losses,
                )
        print("Losses", losses)

This code is different from the code presented here:

In the documentation, the following is the proper code:

for raw_text, entity_offsets in train_data:
    doc = nlp.make_doc(raw_text)
    example = Example.from_dict(doc, {"entities": entity_offsets})
    nlp.update([example], sgd=optimizer)

Do both work?

Also, what is the compounding function passed to size in the minibatch function?

Please advise, many many thanks!!

Hi! We try to keep this forum very focused on Prodigy โ€“ for general questions around spaCy, the discussions board is usually a better place: https://github.com/explosion/spaCy/discussion

To quickly answer your question: chapter 4 of the spaCy course explains the training loop and shows how it works and looks in spaCy v2. If you're using spaCy v3, you usually don't want to implement it from scratch and use the CLI instead, which will take care of specifying all required settings to make sure you're getting good performance. See the training documentation here: https://spacy.io/usage/training Also see this section on how to export your training corpus.

1 Like

Hi Ines,

Thanks so much for the quick reply.
Sorry, my bad. I'll check the discussions board.

Thanks again :slight_smile: