I've successfully trained a german bert based NER model using spaCy 3's spacy train with the new configuration system. YEAH!
Prior to training I've used prodigy's data-to-spacy recipe to export my existing ner train-data to JSON and then converted the JSON data to the new spaCy binary format using spacy convert.
Now I've done the same with my existing text-categorization data.
From the NER experiment I know how to specify the converted textcat data as train and eval data in the spacy train command. But what I can't figure out is how to specify the NER model from step 1 as the model to be trained with the textcat train-data.
Any hints appreciated. TIA.
Hi! Glad to hear your training has been successful
spaCy v3 moves away from the idea of a "base model", which could lead to very subtle, unintuitive behaviours, and made it harder to update component selectively (like, keep the tagger and parser, update the existing NER and add a new textcat).
Instead, you can now explicitly choose which components you want to keep from an existing trained pipeline, which components to update from your data, and which components to ignore. This lets you source components from different pipelines and makes it easier to only update some components (all while training from the CLI and a config). You can read more about this here: https://nightly.spacy.io/usage/training#config-components
So in your case, you could create config with two components:
ner component can then define
source = "/your/bert/ner/model" so it's loaded from your existing model. If you add it to
frozen_components, it won't be updated during training – so your data will only be used for the new text classifier. Here's an example:
lang = "de"
pipeline = ["ner", "textcat"]
frozen_components = ["ner"]
source = "/your/bert/ner/model"
factory = "textcat"
thank you so much for the fast reply.
I’ll give it a try and let you know how it went.
Happy holidays, and stay safe!
Sorry Ines, forgot to ask:
When I converted my previous train-data to the new spaCy binary format, I received the following warning:
UserWarning: [W027] Found a large training file of nnnnnnn bytes.
Note that it may be more efficient to split your training data into multiple smaller JSON files instead.
If I split the JSON into several files, can I still convert the JSON files into a single spacy format file? If yes, how would I do that?
And if I can't, can I specify more than one train file on the command line?
Or would I use your instructions from your above post and do several consecutive trainings?
In this case, you would have multiple
.spacy files – that's the idea and what the warning suggests you should do And yes, instead of a single
.spacy file, you can also provide the path to a directory of files.