Change examples/annonations in prodigy DB from multiple to single

ryanwesslen · November 18, 2022, 2:41am

Awesome! I had forgotten about that option too. I saw it the docs and thought it may help (so I decided to mention in case ).

Yes! They can. If you want to test, then try the en_core_web_lg first. You'll need the vectors which are in the md and lg models. You may not see a lot of improvement but it'll likely not add much for compute speed and memory.

There's sometimes a tendency to immediately go to transformers (en_core_web_trf), but they come with challenges (speed, memory, handling GPU). The speed/simplicity early on with the spaCy models can help you figure out problems with your annotation schemes, which can sometimes improve your model better than architecture (like vectors) or hyperparameters. In a 2018 talk, Matt called it the foundation of the "ML Hierarchy of Needs". Essentially, "categories that will be easy to annotate consistently, and easy for the model to learn."

Once you get promising results with your annotation schemes and performance, then you can test en_core_web_trf. You could also experiment with different textcat architectures.

Here's a related discussion (it was on ner but same idea of speed/accuracy trade-off for base-models applies):

Last idea:

Also, I would recommend using the textcat.correct recipe too. Don't worry about annotating as much as about getting a feel for how your model performs and where's its blind spots. Even better, correct any mistakes it's making and retrain.

If your current annotations are in textcat_data and your model is my_textcat_model, you can load that dataset as your source by prefixing dataset:

python -m prodigy textcat.correct correct_data my_textcat_model dataset:textcat_data ...

I think you may uncover some insight by correcting examples (and improve your model too!).

Let me know if you make progress!

Topic		Replies	Views
From Choice annotations to binary annotations with Teach usage , textcat , spacy	4	986	January 2, 2019
Load annotated data in .spacy format to Prodigy for further correction	2	309	September 20, 2023
Duplicated annotation when changing version ner , spacy	6	556	November 9, 2022
Annotate multiple JSONL into multiple Datasets usage , database , solved , streams	2	549	October 7, 2021
Convert spaCy training json file to prodigy jsonl format for db-in command enhancement , ner , spacy	1	594	June 15, 2020

Change examples/annonations in prodigy DB from multiple to single

Related topics