When Example objects are not created - E930

Hi, I am planning to do some text classification. I have created two datasets, and run the following command to train a model:

python3 -m prodigy train baseline_model --textcat games_train_shuffled,eval:games_dev --base-model "nb_core_news_sm" 

The first strange thing to happen is that I get the following message in the terminal:

Components: textcat
Merging training and evaluation data for 1 components
 - [textcat] Training: 1671 | Evaluation: 238 (from datasets)
Training: 0 | Evaluation: 0
Labels: textcat (0)

The datasets appear to be correctly loaded with the correct number of examples, but then it says there are no training examples and no evaliation examples? Also why no labels?
I then get a E930-error: Received invalid get_examples callback in TextCategorizer.initialize. Expected function that returns an iterable of Example objects but got:

I believe the file i load into the datasets are correct since there are no error messages when i use "db-in".
The examples come from a jsonl-file where each line has the form {"text": "...", "accept": ["..."]}

Any suggestions as to why no Example objects are created?

hi @sofiejb!

Thanks for your question.

So this error comes up typically when you aren't passing data in the right format (but you may think you are).

Can you provide me your data (or even a sample) of it? Unfortunately as it sounds like the problem is somewhere in your data it's hard for me to diagnose your problem without seeing it.

Where did the data for games_train_shuffled and games_dev come from?

Were these annotations ever created in Prodigy? Were all of them imported (i.e., from some different source)? Or were some created in Prodigy but then merged with annotations from a different source.

Just because you didn't receive an error message for db-in doesn't mean there couldn't be an error message. While we have tried to put tests on there, there's always certain manipulations we never thought of to check for.

Also, are you trying to run a binary classification or multi-class (if so, is it mutually exclusive or not categories)?

By running --textcat this is assuming you're running binary classification. Typically, Prodigy creates annotations like this with textcat.manual with a slightly different format:

{"text":"some text", ... ,"answer":"reject","label":"LABEL1",...} # Negative example
{"text":"some different text", ... ,"answer":"accept","label":"LABEL1",...} # Positive example

If you're certain you're handling your data correctly for binary vs. multi-class, then run db-out and look at the data. If you don't spot anything obvious, can you try to create a sample of your data -- say first 10 records -- and then try to reload it db-in and see if it'll run.

If it does run on some of your data, this is important information as it says not all of your data has issues. Then it immediately means: how can you find which records are having problems? And once you find those records, what's different with them versus your other records.

If it doesn't run for the first 10 records, you can try again on random 2-3 records but I suspect likely it may not.

A related alternative is to run data-to-spacy and export out your data as .spacy bin files and a default spaCy config. You can then try to train with spacy train (it should provide instructions on how to run). But alternatively, you may want to run spacy debug data on your config file.

I tried something similar yesterday and if you get a ValueError: [E913] Corpus path can't be None. error then add the explicit file path to your train.spacy and dev.spacy files in your config file under:

train = path/to/spacy.train
dev = path/to/dev.train