spaCy 3 nightly: How to further train an existing model


I've successfully trained a german bert based NER model using spaCy 3's spacy train with the new configuration system. YEAH!

Prior to training I've used prodigy's data-to-spacy recipe to export my existing ner train-data to JSON and then converted the JSON data to the new spaCy binary format using spacy convert.

Now I've done the same with my existing text-categorization data.

From the NER experiment I know how to specify the converted textcat data as train and eval data in the spacy train command. But what I can't figure out is how to specify the NER model from step 1 as the model to be trained with the textcat train-data.

Any hints appreciated. TIA.

Hi! Glad to hear your training has been successful :tada:

spaCy v3 moves away from the idea of a "base model", which could lead to very subtle, unintuitive behaviours, and made it harder to update component selectively (like, keep the tagger and parser, update the existing NER and add a new textcat).

Instead, you can now explicitly choose which components you want to keep from an existing trained pipeline, which components to update from your data, and which components to ignore. This lets you source components from different pipelines and makes it easier to only update some components (all while training from the CLI and a config). You can read more about this here:

So in your case, you could create config with two components: ner and textcat. The ner component can then define source = "/your/bert/ner/model" so it's loaded from your existing model. If you add it to frozen_components, it won't be updated during training – so your data will only be used for the new text classifier. Here's an example:

lang = "de"
pipeline = ["ner", "textcat"]
# etc.

frozen_components = ["ner"]
# etc.


source = "/your/bert/ner/model"

factory = "textcat"
# etc.

Hi Ines,
thank you so much for the fast reply.
I’ll give it a try and let you know how it went.

Happy holidays, and stay safe!

1 Like

Sorry Ines, forgot to ask:

When I converted my previous train-data to the new spaCy binary format, I received the following warning:

UserWarning: [W027] Found a large training file of nnnnnnn bytes.
Note that it may be more efficient to split your training data into multiple smaller JSON files instead.

If I split the JSON into several files, can I still convert the JSON files into a single spacy format file? If yes, how would I do that?

And if I can't, can I specify more than one train file on the command line?

Or would I use your instructions from your above post and do several consecutive trainings?

In this case, you would have multiple .spacy files – that's the idea and what the warning suggests you should do :slightly_smiling_face: And yes, instead of a single .spacy file, you can also provide the path to a directory of files.