We are trying to change our annotations in the prodigy DB programmatically using python. We updated the record['config']['choice_style'] = 'single' and record['accept'] to a single value in a list. We were able to successfully import the jsonl file to the database using db-in. But when we run prodigy train --textcat, we are getting all 0.00 for the score. Is there anything else we need to do to update the DB?
So it seems like you want to modify your annotations that were labeled as multi-label (i.e., non-mutually exclusive using the multiple option in the choice UI), but refactor them as mutually exclusive annotations (which would usually use the single option in the choice UI).
You were right on making "accept" single values in the list. But instead of updating the record['config']['choice_style'] = 'single'", can you add an "options" key with a list of dictionaries for each of the categories:
Gotcha. That makes sense as you said you were still able to run prodigy train. If there's a problem with the formatting for training, then you'd usually get an error message. Since prodigy train ran, it's probably not be a formatting issue.
Per the spaCy GitHub discussion, "flat zeros" typically indicates a problem with the data. spacy debug data was written for this. Could you try to use data-to-spacy to create a config file along with spacy datasets (train and dev datasets) and see if you can run spacy debug data on the output?
FYI, data-to-spacy is a great way to introduce yourself to training with spacy train directly instead of prodigy train, which is just a wrapper for spacy train. As you get to more advanced modeling, getting comfort around the config.cfg file and spacy train will help you a ton with customizing your models.
Try these out and let me know what your output is with spacy debug data. Hopefully we can figure something out
============================ Data file validation ============================
Pipeline can be initialized with data
Corpus is loadable
=============================== Training stats ===============================
Language: en
Training pipeline: textcat
712 training docs
177 evaluation docs
No overlap between training and evaluation data
Low number of examples to train a new pipeline (712)
============================== Vocab & Vectors ==============================
712 total word(s) in the data (712 unique)
No word vectors present in the package
================== Text Classification (Exclusive Classes) ==================
Text Classification: 8 label(s)
I'm at a bit of a loss of ideas. Could there be any chance there was a problem with your programmatic approach? For example, were you able to verify your approach worked on say 10 random examples?
Also, just curious, if you take your original or your modified dataset, what happens when you run prodigy train --textcat-multilabel?
I don't think there is a problem with the programmatic approach, since it was simple to modify those properties, and wouldn't prodigy/spacy output an error if the data was not correct?
Running --textcat-multilabel gives non-zero scores, so I don't understand why --textcat does.
spaCy would give errors if you tried to run --textcat but have multiple values in the list or if it wasn't formatted correctly.
I was wondering if there was some coding mistake (e.g., in the if/then logic) where you accidentally labeled something incorrectly. But I'm still skeptical b/c I would expect even something to be non-zero even if there was an incorrect value.
You may want to open an issue with spaCy GitHub discussion. That's where the spaCy core team can help. I bet you could get a fast response if you provide a reproducible examples with 5-10 examples that you have verified you coded correctly and provide the config.cfg (from the data-to-spacy). Also, try to show spacy debug data, also include --verbose` too as it'll show all the labels as an output to verify what are all of the values. I'll keep on thinking if there's anything else I'm missing.
Found the issue with the --verbose option. It was the 'text' field/property. Since we are using a custom recipe, it wasn't set correctly. It was set to the name of the audio file. That's why we weren't getting any good results for weeks now. Thank you very much for suggesting the --verbose option on the spacy debug data.
We tried on --textcat-multilabel (orig datasets), and we were getting 70% in the score column. But --textcat (updated datasets), we were getting very low results, lower than 50%. I thought that a single label text classification would perform better.
Awesome! I had forgotten about that option too. I saw it the docs and thought it may help (so I decided to mention in case ).
Yes! They can. If you want to test, then try the en_core_web_lg first. You'll need the vectors which are in the md and lg models. You may not see a lot of improvement but it'll likely not add much for compute speed and memory.
There's sometimes a tendency to immediately go to transformers (en_core_web_trf), but they come with challenges (speed, memory, handling GPU). The speed/simplicity early on with the spaCy models can help you figure out problems with your annotation schemes, which can sometimes improve your model better than architecture (like vectors) or hyperparameters. In a 2018 talk, Matt called it the foundation of the "ML Hierarchy of Needs". Essentially, "categories that will be easy to annotate consistently, and easy for the model to learn."
Once you get promising results with your annotation schemes and performance, then you can test en_core_web_trf. You could also experiment with different textcat architectures.
Here's a related discussion (it was on ner but same idea of speed/accuracy trade-off for base-models applies):
Last idea:
Also, I would recommend using the textcat.correct recipe too. Don't worry about annotating as much as about getting a feel for how your model performs and where's its blind spots. Even better, correct any mistakes it's making and retrain.
If your current annotations are in textcat_data and your model is my_textcat_model, you can load that dataset as your source by prefixing dataset: