Don't understand the label files from data-to-spacy

I have two datasets; e.g. tags-a and tags-b. Each have binary textcat data for label A and label B.

Then I run

prodigy data-to-spacy -tcm tags-a,tags-b spacy-data

which created the following files

|- labels
   |- textcat_multilabel.json
|- config.cfg
|- dev.spacy
|- train.spacy

Now what puzzles me is the content of textcat_multilabel.json:


I'd expect it to be


If I use -tc instead of -tcm then it also creates textcat.json file with the expected content (note that it also creates that - it still creates the "wrong" textcat_multilabel.json).

Can someone explain that?

Bonus question

If I have a custom tokenizer do I need to use that via --base-model when using data-to-spacy via callbacks?

Earlier today I used prodigy train with --base-model pointing to a config that had

@callbacks = "customize_tokenizer"

It trained fine but when using the model after training then the custom tokenizer wasn't applied. Isn't that a mistake? I shouldn't have to do nlp.initialize() on my trained nlp, right? The same applies to data-to-spacy I assume. So I'd love someone to shed some light, thank you.

To answer my initial question; I was using a --base-model that only had label A. If I use the default en model then it outputs as expected.

So now I just need to know if I need the custom tokenizer to be when using data-to-spacy?

The output of the labels.json corresponds to what's produced by spaCy's init labels: The idea here is to pre-generate the labels so your training can start quicker because spaCy doesn't need to loop over the data to generate the label set. This can speed things up by a lot. And it makes sense to just do it out-of-the-box in data-to-spacy since we already have the data in memory here anyway.

It's definitely suspicious, though, if the labels in the output don't correspond to the data or just reflect the base model's label set – we need to look into this, maybe the labels aren't added/generated correctly :thinking:

If you have customisations like that, it's probably better to just provide a --config that exactly reflects how your pipeline should be set up. Then Prodigy doesn't have to try and be clever and figure out what to port over from the base model vs. what should be modified and you know that the config will always match what's generated at the end and what you intend to train from.