Flag --batch-size not recognized by prodigy train

I am using the nightly version 1.11.0a8 and when I run prodigy train with --batch-size flag, an exception is raised for "no such option". The current documentation has many flags for training e.g. --n-iter, --dropout and --batch-size.

Where does prodigy save a training config for each run? Does it use a default stored somewhere?

How can we change the default batch-size with the new version?

Thanks.

Hi!

prodigy train has been updated significantly in v1.11 to make use of the new and more powerful spacy train command that was released with spaCy v3. This new training command generates a config file that contains all the settings for training.

With Prodigy v1.11, you have two main options to use this training command:

  1. You call prodigy data_to_spacy to convert the Prodigy datasets into a format suitable for training with spaCy. This command will generate an output directory with all the relevant data files AND a default configuration file. You can then edit that configuration file as you see fit.

You can find some more information on how the configuration file is structured in the spaCy docs, more specifically there is a section on the training part of it. Here, you'll find that there is a batcher setting that typically looks something like this:

[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2
get_length = null

[training.batcher.size]
@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001
t = 0.0

What this means, is that by default a compounding batch size is used, and you can edit the start and stop values, or any other setting, according to your use-case.

Once you're done editing the config, you can call spacy train with it:

python -m spacy train output/config.cfg --paths.train output/train.spacy --paths.dev output/dev.spacy
  1. As a second option, you can call prodigy train and the training will start immediately. Behind the scenes, there will still be a config file that gets generated with default values. To overwrite those values, you can pass them on the command line directly like so:
prodigy train output ... --training.batcher.size.start=35 --training.batcher.size.stop=50

Once you're starting to overwrite the config like this though, I'd personally advice working with data_to_spacy instead and editing the config file yourself. If you want a steady batch size, you could for instance remove the whole compounding block and just make it something like this:

[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2
get_length = null
size = 50

Hope that clarifies things :slight_smile:

2 Likes

Hi @SofieVL

Thank you so much for clarifying the way cfg files are generated behind the scene or can be overwritten.

I am relatively new to spacy as well and it is nice to see that many workflows can be flexibly accomodated.

When do you recommend using data-to-spacy for generating config vs using the inti-config command from spacy v3? Do you recommend using the project.yml on top as suggested in tutorials?

As for the training flags for prodigy train, how does the batch size work for NER model? The trainer config you showed sugggest the batching is done by words. If my input text is a paragraph, how does that work? Could the model get only a portion of the paragraph? Is there a way to pass it one batch of one paragraph at a time?

Finally, how does the --ner-missing flag translate if I were to run data-to-spacy with that flag? Which part of the config does it influence?

I understand this forum wants to focus on prodigy. However, I am still exploring and wish to make smooth transition to adopting spacy project and configuration. Thanks a lot for your help.

If you have more general spaCy questions, you're always welcome at the spaCy discussion forum, but a lot of your questions are on the intersection between spaCy and Prodigy, so I'm happy to answer them here.

When do you recommend using data-to-spacy for generating config vs using the inti-config command from spacy v3?

In fact, data-to-spacy uses spaCy's init-config under the hood, so it kind of boils down to the same thing. But some parameters might be easier to set through init-config directly. You can use both as well: first data-to-spacy to generate your .spacy files, and then init-config to create a config that is tuned towards your use-case.

Do you recommend using the project.yml on top as suggested in tutorials?

The project.yml file is needed when you want to make a "spaCy project". Conceptually, a spaCy project is a directory of scripts/code, and the project.yml defines all the different steps in your workflow. It's really convenient if you have a multi-step Machine Learning project, where you want to easily reproduce results, re-run certain steps and freezing others, etc. I would definitely recommend you trying it out, for instance with one of the example projects in the repo here: GitHub - explosion/projects: 🪐 End-to-end NLP workflows from prototype to production

As for the training flags for prodigy train, how does the batch size work for NER model? The trainer config you showed sugggest the batching is done by words. If my input text is a paragraph, how does that work? Could the model get only a portion of the paragraph? Is there a way to pass it one batch of one paragraph at a time?

Perhaps the documentation isn't entirely clear on this point. The batching by words means that each batch has roughly that many words (with a "tolerance" margin), but documents are never split up. They are kept as such. If a document exceeds the batch size, it is either returned in a batch of its own, or discarded completely (if discard_oversize is set to True).

Finally, how does the --ner-missing flag translate if I were to run data-to-spacy with that flag? Which part of the config does it influence?

The ner-missing flag isn't part of the spaCy config. It's used in Prodigy to transform the Prodigy dataset to the spaCy format. It basically toggles whether a token that is not annotated should be viewed as missing/incomplete annotation (in spaCy, that's denoted with '-') or specifically as an annotation for "this is not an entity" (in spaCy, that's denoted with 'O'). Both these tags are used in the internal "BILUO" scheme that spaCy uses.

If you'd have a sentence "I like London", the BILUO notation would be ["O", "O", "U-LOC"] if you're certain that the first two tokens are not part of an entity. If you're not certain of that, you'd use ["-", "-", "U-LOC"] instead to signal to the NER model that the annotation is incomplete/unknown for the first two tokens.