Now what puzzles me is the content of textcat_multilabel.json:
[
"A"
]
I'd expect it to be
[
"A",
"B"
]
If I use -tc instead of -tcm then it also creates textcat.json file with the expected content (note that it also creates that - it still creates the "wrong" textcat_multilabel.json).
Can someone explain that?
Bonus question
If I have a custom tokenizer do I need to use that via --base-model when using data-to-spacy via callbacks?
Earlier today I used prodigy train with --base-model pointing to a config that had
It trained fine but when using the model after training then the custom tokenizer wasn't applied. Isn't that a mistake? I shouldn't have to do nlp.initialize() on my trained nlp, right? The same applies to data-to-spacy I assume. So I'd love someone to shed some light, thank you.
The output of the labels.json corresponds to what's produced by spaCy's init labels: Command Line Interface · spaCy API Documentation The idea here is to pre-generate the labels so your training can start quicker because spaCy doesn't need to loop over the data to generate the label set. This can speed things up by a lot. And it makes sense to just do it out-of-the-box in data-to-spacy since we already have the data in memory here anyway.
It's definitely suspicious, though, if the labels in the output don't correspond to the data or just reflect the base model's label set – we need to look into this, maybe the labels aren't added/generated correctly
If you have customisations like that, it's probably better to just provide a --config that exactly reflects how your pipeline should be set up. Then Prodigy doesn't have to try and be clever and figure out what to port over from the base model vs. what should be modified and you know that the config will always match what's generated at the end and what you intend to train from.