Don't understand the label files from data-to-spacy

nix411 · February 1, 2022, 8:12pm

I have two datasets; e.g. tags-a and tags-b. Each have binary textcat data for label A and label B.

Then I run

prodigy data-to-spacy -tcm tags-a,tags-b spacy-data

which created the following files

spacy-data
|- labels
   |- textcat_multilabel.json
|- config.cfg
|- dev.spacy
|- train.spacy

Now what puzzles me is the content of textcat_multilabel.json:

[
  "A"
]

I'd expect it to be

[
  "A",
  "B"
]

If I use -tc instead of -tcm then it also creates textcat.json file with the expected content (note that it also creates that - it still creates the "wrong" textcat_multilabel.json).

Can someone explain that?

Bonus question

If I have a custom tokenizer do I need to use that via --base-model when using data-to-spacy via callbacks?

Earlier today I used prodigy train with --base-model pointing to a config that had

[initialize.before_init]
@callbacks = "customize_tokenizer"

It trained fine but when using the model after training then the custom tokenizer wasn't applied. Isn't that a mistake? I shouldn't have to do nlp.initialize() on my trained nlp, right? The same applies to data-to-spacy I assume. So I'd love someone to shed some light, thank you.

nix411 · February 1, 2022, 8:24pm

To answer my initial question; I was using a --base-model that only had label A. If I use the default en model then it outputs as expected.

So now I just need to know if I need the custom tokenizer to be when using data-to-spacy?

ines · February 5, 2022, 1:30pm

The output of the labels.json corresponds to what's produced by spaCy's init labels: Command Line Interface · spaCy API Documentation The idea here is to pre-generate the labels so your training can start quicker because spaCy doesn't need to loop over the data to generate the label set. This can speed things up by a lot. And it makes sense to just do it out-of-the-box in data-to-spacy since we already have the data in memory here anyway.

It's definitely suspicious, though, if the labels in the output don't correspond to the data or just reflect the base model's label set – we need to look into this, maybe the labels aren't added/generated correctly

If you have customisations like that, it's probably better to just provide a --config that exactly reflects how your pipeline should be set up. Then Prodigy doesn't have to try and be clever and figure out what to port over from the base model vs. what should be modified and you know that the config will always match what's generated at the end and what you intend to train from.

Topic		Replies	Views
training data format for multiclass textcat Getting Started usage , textcat	7	1552	August 29, 2022
textcat_multilabel with only some labels annotated for some examples	5	377	June 14, 2022
Exporting dataset from prodigy and train textcat in spaCy v3 textcat , done , spacy	6	895	August 12, 2021
Textcat possible problem with uneven dataset? usage , textcat , done	2	956	January 17, 2020
Use textcat and textcat_multilabel in the same model textcat , spacy	1	347	May 19, 2022

Don't understand the label files from data-to-spacy

Bonus question

Related topics