spaCy 3 nightly: How to further train an existing model

kamiwa · December 20, 2020, 1:27pm

Hi,

I've successfully trained a german bert based NER model using spaCy 3's spacy train with the new configuration system. YEAH!

Prior to training I've used prodigy's data-to-spacy recipe to export my existing ner train-data to JSON and then converted the JSON data to the new spaCy binary format using spacy convert.

Now I've done the same with my existing text-categorization data.

From the NER experiment I know how to specify the converted textcat data as train and eval data in the spacy train command. But what I can't figure out is how to specify the NER model from step 1 as the model to be trained with the textcat train-data.

Any hints appreciated. TIA.

ines · December 20, 2020, 11:15pm

Hi! Glad to hear your training has been successful

spaCy v3 moves away from the idea of a "base model", which could lead to very subtle, unintuitive behaviours, and made it harder to update component selectively (like, keep the tagger and parser, update the existing NER and add a new textcat).

Instead, you can now explicitly choose which components you want to keep from an existing trained pipeline, which components to update from your data, and which components to ignore. This lets you source components from different pipelines and makes it easier to only update some components (all while training from the CLI and a config). You can read more about this here: https://nightly.spacy.io/usage/training#config-components

So in your case, you could create config with two components: ner and textcat. The ner component can then define source = "/your/bert/ner/model" so it's loaded from your existing model. If you add it to frozen_components, it won't be updated during training – so your data will only be used for the new text classifier. Here's an example:

[nlp]
lang = "de"
pipeline = ["ner", "textcat"]
# etc.

[training]
frozen_components = ["ner"]
# etc.

[components]

[components.ner]
source = "/your/bert/ner/model"

[components.textcat]
factory = "textcat"
# etc.

kamiwa · December 20, 2020, 11:31pm

Hi Ines,
thank you so much for the fast reply.
I’ll give it a try and let you know how it went.

Happy holidays, and stay safe!

kamiwa · December 20, 2020, 11:46pm

Sorry Ines, forgot to ask:

When I converted my previous train-data to the new spaCy binary format, I received the following warning:

UserWarning: [W027] Found a large training file of nnnnnnn bytes.
Note that it may be more efficient to split your training data into multiple smaller JSON files instead.

If I split the JSON into several files, can I still convert the JSON files into a single spacy format file? If yes, how would I do that?

And if I can't, can I specify more than one train file on the command line?

Or would I use your instructions from your above post and do several consecutive trainings?

ines · December 21, 2020, 10:16pm

In this case, you would have multiple .spacy files – that's the idea and what the warning suggests you should do And yes, instead of a single .spacy file, you can also provide the path to a directory of files.

shahinshirazi · February 3, 2023, 1:57pm

Just FYI, the link "spaCy 3 nightly: How to further train an existing model - #4 by kamiwa" is not working. But I think I can follow your note to update the model

Topic		Replies	Views
updating training pipline of NER from spacy 2 to spacy 3 spacy , off-topic	4	6636	June 24, 2021
Similar models to en_core_web_lg/en_vectors_web_lg usage , spacy	5	1282	February 25, 2021
Training the NER pipeline component of an existing model ner , spacy , off-topic	2	915	September 14, 2021
SpaCy training from data-to-spacy output ? usage , spacy	8	1827	June 14, 2022
SpaCy3 models evaluation on a custom dataset usage , spacy , solved , training	3	641	July 7, 2021

spaCy 3 nightly: How to further train an existing model

Related topics