If you have more general spaCy questions, you're always welcome at the spaCy discussion forum, but a lot of your questions are on the intersection between spaCy and Prodigy, so I'm happy to answer them here.
When do you recommend using data-to-spacy for generating config vs using the inti-config command from spacy v3?
In fact, data-to-spacy
uses spaCy's init-config
under the hood, so it kind of boils down to the same thing. But some parameters might be easier to set through init-config
directly. You can use both as well: first data-to-spacy
to generate your .spacy
files, and then init-config
to create a config that is tuned towards your use-case.
Do you recommend using the project.yml on top as suggested in tutorials?
The project.yml
file is needed when you want to make a "spaCy project". Conceptually, a spaCy project is a directory of scripts/code, and the project.yml
defines all the different steps in your workflow. It's really convenient if you have a multi-step Machine Learning project, where you want to easily reproduce results, re-run certain steps and freezing others, etc. I would definitely recommend you trying it out, for instance with one of the example projects in the repo here: GitHub - explosion/projects: 🪐 End-to-end NLP workflows from prototype to production
As for the training flags for prodigy train, how does the batch size work for NER model? The trainer config you showed sugggest the batching is done by words. If my input text is a paragraph, how does that work? Could the model get only a portion of the paragraph? Is there a way to pass it one batch of one paragraph at a time?
Perhaps the documentation isn't entirely clear on this point. The batching by words means that each batch has roughly that many words (with a "tolerance" margin), but documents are never split up. They are kept as such. If a document exceeds the batch size, it is either returned in a batch of its own, or discarded completely (if discard_oversize
is set to True
).
Finally, how does the --ner-missing flag translate if I were to run data-to-spacy with that flag? Which part of the config does it influence?
The ner-missing
flag isn't part of the spaCy config. It's used in Prodigy to transform the Prodigy dataset to the spaCy format. It basically toggles whether a token that is not annotated should be viewed as missing/incomplete annotation (in spaCy, that's denoted with '-'
) or specifically as an annotation for "this is not an entity" (in spaCy, that's denoted with 'O'
). Both these tags are used in the internal "BILUO" scheme that spaCy uses.
If you'd have a sentence "I like London"
, the BILUO notation would be ["O", "O", "U-LOC"]
if you're certain that the first two tokens are not part of an entity. If you're not certain of that, you'd use ["-", "-", "U-LOC"]
instead to signal to the NER model that the annotation is incomplete/unknown for the first two tokens.