Hi!
prodigy train
has been updated significantly in v1.11 to make use of the new and more powerful spacy train
command that was released with spaCy v3. This new training command generates a config file that contains all the settings for training.
With Prodigy v1.11, you have two main options to use this training command:
- You call
prodigy data_to_spacy
to convert the Prodigy datasets into a format suitable for training with spaCy. This command will generate an output directory with all the relevant data files AND a default configuration file. You can then edit that configuration file as you see fit.
You can find some more information on how the configuration file is structured in the spaCy docs, more specifically there is a section on the training
part of it. Here, you'll find that there is a batcher
setting that typically looks something like this:
[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2
get_length = null
[training.batcher.size]
@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001
t = 0.0
What this means, is that by default a compounding
batch size is used, and you can edit the start
and stop
values, or any other setting, according to your use-case.
Once you're done editing the config, you can call spacy train
with it:
python -m spacy train output/config.cfg --paths.train output/train.spacy --paths.dev output/dev.spacy
- As a second option, you can call
prodigy train
and the training will start immediately. Behind the scenes, there will still be a config file that gets generated with default values. To overwrite those values, you can pass them on the command line directly like so:
prodigy train output ... --training.batcher.size.start=35 --training.batcher.size.stop=50
Once you're starting to overwrite the config like this though, I'd personally advice working with data_to_spacy
instead and editing the config file yourself. If you want a steady batch size, you could for instance remove the whole compounding
block and just make it something like this:
[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2
get_length = null
size = 50
Hope that clarifies things