Can't create new dataset with prodigy (error: [x] Can't find recipe or command 'dataset'.)

I'm following Ines' "Training an Insult Classifier" video which I realize is from 2017 so some of the commands may differ with the current version of Prodigy, which I am using.

In the beginning, around the 1:55 mark in the video, Ines runs the command prodigy dataset insult_seeds "collect seed terms for insult classifier" when i try to run the command as !python -m prodigy dataset insult_seeds "Collect seed terms for insult classifier", I get the error:

[x] Can't find recipe or command 'dataset'. Similar recipes: print-dataset

I'm not sure why it's not recognizing the command as I don't believe the 'dataset' argument has changed for the current version of prodigy. I am also using windows, hence my usage of !python -m before each command.

Is there something I am doing wrong, or something i need to do differently because of the windows machine?

Hi! The dataset command has been deprecated since Prodigy v1.10 and was removed in Prodigy v1.11. See here for details: https://prodi.gy/docs/recipes#deprecated

If a dataset isn't available yet, Prodigy will create it automatically, so there's not really a good reason you need to create it explicitly with dataset. So you can simply skip this part now :slightly_smiling_face:

Ahh i see! Thanks so much Ines! I thought I saw somewhere on the Prodigy 101 docs that dataset was being used somewhere so I didn't know it was deprecated. Thanks again for your help!

Hi Ines! I was able to do as you said and just run the next command and prodigy automatically created the dataset for me. When I try to run your next command to load the reddit comments, I get this error saying it doesn't recognize the dataset I am trying to pass in...not sure if I'm using a deprecated command again?

When I run these commands, it seems to work and I am able to save the annotation:

!python -m prodigy terms.teach insult_seeds en_core_web_lg --seeds insults.txt

then to check:

!python -m prodigy db-out insult_seeds

which outputs:

{"text":"dick","meta":{"score":0.8098309199},"_input_hash":-690856415,"_task_hash":-1277424855,"_session_id":null,"_view_id":"text","answer":"accept"}
{"text":"fuck","meta":{"score":0.8087529978},"_input_hash":-192289499,"_task_hash":1698676519,"_session_id":null,"_view_id":"text","answer":"reject"}
{"text":"pussy","meta":{"score":0.8068196275},"_input_hash":436531074,"_task_hash":1495846518,"_session_id":null,"_view_id":"text","answer":"accept"}
{"text":"fucking","meta":{"score":0.8057290128},"_input_hash":481507976,"_task_hash":-1327770953,"_session_id":null,"_view_id":"text","answer":"reject"}
{"text":"fucker","meta":{"score":0.8030490282},"_input_hash":-1821531370,"_task_hash":2079464020,"_session_id":null,"_view_id":"text","answer":"accept"}
{"text":"cock","meta":{"score":0.7993727761},"_input_hash":1199091593,"_task_hash":-1521700463,"_session_id":null,"_view_id":"text","answer":"reject"}
{"text":"whore","meta":{"score":0.7981011512},"_input_hash":1483481241,"_task_hash":179717541,"_session_id":null,"_view_id":"text","answer":"accept"}.............

But then when I try to run the command to use this dataset with the reddit comments, this is what I get:

!python -m prodigy textcat.teach insults en_core_web_sm RC_2015-01.bz2 --loader reddit --label INSULT --patterns insult_seeds

output:

Using 1 label(s): INSULT
Traceback (most recent call last):
  File "C:\Users\t724614\Anaconda3\lib\runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\t724614\Anaconda3\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "C:\Users\t724614\Anaconda3\lib\site-packages\prodigy\__main__.py", line 54, in <module>
    controller = recipe(*args, use_plac=True)
  File "cython_src\prodigy\core.pyx", line 329, in prodigy.core.recipe.recipe_decorator.recipe_proxy
  File "C:\Users\t724614\Anaconda3\lib\site-packages\plac_core.py", line 367, in call
    cmd, result = parser.consume(arglist)
  File "C:\Users\t724614\Anaconda3\lib\site-packages\plac_core.py", line 232, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "C:\Users\t724614\Anaconda3\lib\site-packages\prodigy\recipes\textcat.py", line 71, in teach
    matcher = matcher.from_disk(patterns)
  File "cython_src\prodigy\models\matcher.pyx", line 260, in prodigy.models.matcher.PatternMatcher.from_disk
ValueError: Can't find patterns file: insult_seeds

The dataset insult_seeds should be in the same folder i am working in (I assume) so I haven't added a ./. That being said, I tested it with a ./ as well and got the same error.

Older versions of Prodigy just let you use the seeds dataset directly in the textcat recipe, but newer versions now all standardise on a patterns file, which gives you more flexibility. This is expected to be a JSONL file on disk and you can create it with terms.to-patterns. There's also a section in the video description that explains the differences in newer versions of Prodigy (since the video is already a couple of years old):

Since this video was recorded, the textcat.teach command has changed in one detail: instead of a --seeds argument, you can now pass in --patterns, which lets you describe single words but also more complex combinations of tokens based on their attributes. To convert a seed dataset to patterns, you can use the terms.to-patterns recipe. For more details, see here: Seeds not recognized by textcat.teach

1 Like

Hi Ines,

Sorry to bother you again! I've gotten as far as trying to train the model but I keep getting an error about the --output argument... I'm assuming this must also have to do with deprecation but I haven't been able to find that info online.

I am running:

!python -m prodigy train insults --output insults-model --eval-split 0.2

and get the error:

click.exceptions.NoSuchOption: no such option: --output

I also tried to run it using en_core_web_sm like you did the "Training an Insults Classifier" video, but then I get the erorr:

[x] Invalid config override 'en_core_web_sm': name should start with --

Is using en_core_web_sm , en_core_web_lg, etc necessary for this step? If so, how do I get it to work when running the command? adding -- before the argument does not seem to work

On another note, I got to this point following your advice in this thread by using two labels (insult and non_insult) because I wasn't able to get the textcat_multilabel argument working. Would you be able to help with this as well?

I run the command:

!python -m prodigy textcat.teach insults en_core_web_sm RC_2010-01.bz2 --loader reddit --textcat_multilabel INSULT --patterns insult_seeds.jsonl 

and get the error:

prodigy textcat.teach: error: unrecognized arguments: --textcat_multilabel INSULT

Maybe I am using textcat_multilabel incorrectly?

Ah, the train recipe is another workflow that's changed a bit in the recent update to make it work seamlessly with spaCy v3. It might be easiest to start from the docs here and it'll show you the available arguments: https://prodi.gy/docs/recipes#training

So in your case, your training command would look something like this:

!python -m prodigy train insults-model --textcat-multilabel insults --eval-split 0.2

The --textcat-multilabel option is used during training and specifies the component you want to train from the data (because in theory, the same data could be used to train different components, e.g. different types of text classifiers). So when you train, you can set it to your dataset name to tell Prodigy to "train a multilabel textcat component from this dataset".

Thank you for your reply! I tried the training command you suggested (along with a few variations) but I still seem to be getting the error:

click.exceptions.NoSuchOption: no such option: --textcat-multilabel

I tried the same command with just --textcat but then it tells me I should use ---textcat-multilabel.
Any idea what could be causing this?

Which version of Prodigy are you using and could you share the exact training command you ran? The --textcat-multilabel argument was added to the train command in v1.11.0, so if you're using that (which sounds like it based on your previous outputs), you should have it available :thinking:

I'm using prodigy v1.11.0a8 with python v3.8.8...so it should be available. The command I'm running is:

!python -m prodigy train insults-model --textcat-multilabel insults --eval-split 0.2

I checked the insults dataset stats as well and it shows that there are annotations that I accepted, rejected, and ignored.
I also tried the abbreviated -tcm command after which I got this error:

[x] Invalid config override 'insults': name should start with --

so it seems to recognize -tcm from what I can understand, but then it doesn't recognize insults as a dataset.

Ahh, if you're on v1.11.0a8, that's definitely a problem: that's an alpha pre-release of the nightly, so it won't have all the features of the stable release (and is just an alpha). So you should upgrade to the latest stable v1.11!

1 Like

Thank you so much Ines! This worked!

1 Like