Hi,
I'm learning from your online course (https://course.spacy.io/en/chapter4)
and am trying to train the NEW pipeline component of an existing pre-trained model (lg/trf).
Is is possible to train all models? I read a few posts claiming it is not possible with en_core_web_trf?
I have run this code:
with nlp.disable_pipes(*unaffected_pipes):
# Loop for 10 iterations
for i in range(10):
# Shuffle the training data
random.shuffle(TRAINING_DATA)
losses = {}
# Create batches and iterate over them
for batch in spacy.util.minibatch(items=TRAINING_DATA, size=2):
# Split the batch into texts and annotations
texts = [text for text, annotation in batch]
annotations = [annotation for text, annotation in batch]
# Update the model
nlp.update(texts, annotations, losses=losses)
# Save the model
nlp.to_disk("/content/sample_data")
I have gotten this error:
ValueError: [E989] nlp.update()
was called with two positional arguments. This may be due to a backwards-incompatible change to the format of the training data in spaCy 3.0 onwards. The 'update' function should now be called with a batch of Example objects, instead of (text, annotation)
tuples.
From reading the documentation on the new update() function, I still don't understand how to apply it.
I also see no mentioning of the minibatch() function.
I used to run this line of code:
spacy.util.minibatch(items=TRAINING_DATA, size=2)
Is minibatching not used anymore?
I also see many posts suggesting the following:
# Import requirements
import random
from spacy.util import minibatch, compounding
from pathlib import Path
# TRAINING THE MODEL
with nlp.disable_pipes(*unaffected_pipes):
# Training for 30 iterations
for iteration in range(30):
# shuufling examples before every iteration
random.shuffle(TRAIN_DATA)
losses = {}
# batch up the examples using spaCy's minibatch
batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001))
for batch in batches:
texts, annotations = zip(*batch)
nlp.update(
texts, # batch of texts
annotations, # batch of annotations
drop=0.5, # dropout - make it harder to memorise data
losses=losses,
)
print("Losses", losses)
This code is different from the code presented here:
In the documentation, the following is the proper code:
for raw_text, entity_offsets in train_data:
doc = nlp.make_doc(raw_text)
example = Example.from_dict(doc, {"entities": entity_offsets})
nlp.update([example], sgd=optimizer)
Do both work?
Also, what is the compounding function passed to size in the minibatch function?
Please advise, many many thanks!!