textcat.batch-train question

jiebei · November 4, 2022, 1:40pm

Hi, I am following this tutorial to explore textcat. When I run the command below, i got "Can't find recipe or command 'textcat.batch-train'." error, I was wondering if this is the correct recipe to use? Thanks!

python -m prodigy textcat.batch-train textcat_insults en_vectore_web_lg --output insults-model --eval-split 0.2

ryanwesslen · November 4, 2022, 1:54pm

hi @jiebei!

batch-train recipes like ner.batch-train and textcat-batch-train were deprecated and removed as of v1.11. That video was created long ago and that's the one issue with creating video tutorials: we can't fix it if we make changes to our code in the future.

Instead use the train recipe. This is a general function where now you have to provide as an argument what type of model you want to train with the training dataset train --textcat [training_dataset] where, [training_dataset] is the name of your training dataset.

Let us know if you have any questions!

jiebei · November 4, 2022, 6:24pm

Thanks for pointing me to the updated docs. I used --textcat-multilabel for my two label annotation data, and obtained the results as attached below.

I am thinking of a few things to try next, e.g., revising my schema (having more labels in the same level, or creating a nested schema with more labels), and moving to use the model on a large chat dataset for real categorization. Could you recommend a few steps for me to get on track? Thanks again!

ryanwesslen · November 7, 2022, 1:48pm

hi @jiebei!

I would recommend using textcat.correct to identify weak spots in your model.

The idea would be with your current model to look at examples and to understand what are problems your model is still having (i.e., is there a pattern in the incorrect examples).

Hopefully by doing this, it should answer your question of the next step. For example, if you're finding the model is struggling on examples that seem not to fit either of your two labels, this may suggest the need of expanding to three labels. Alternatively, if it's looking like there are sub-categories within your existing categories, that's more indicative that you may want to create nested labels.

jiebei · November 22, 2022, 1:10am

Hi Ryan, I have a follow up question of using textcat. Let's say I have obtained a satisfactory model to categorize my chat data with Research, and Nonsearch, how can I use the model to process my over about 1 million chat data?
If I am thinking of moving to adding more categories, but with two potential approaches, one level or two levels, how can I compare these two models down the road?

Thank you so much for your thoughts!

ryanwesslen · November 25, 2022, 2:34pm

hi @jiebei!

So is your question: how to speed up your spaCy model to handle lots of text?

@koaning has a great video on using nlp.pipe:

As you probably realized, if you have different categorization schemes, it's a bit hard to compare across accuracies. However, one resource is @koaning's spacy-report package. It can allow you to compare precision/recall tradeoffs by category. This package can help evaluate the performance for each model by category, thus finding where either model may have weaknesses by category.

gif (1)

Alternatively, you could move towards more qualitative approaches by comparing examples. One idea could be to use spacy-streamlit to load either models and apply it to input text. It doesn't compare side-by-side (you'd need to load one model at a time) but it can still be informative.

Hope these help and let us know if you have further questions!

jiebei · November 27, 2022, 8:52pm

Thank you, it is so helpful!

With regards to implementing the second level for my schema, I am looking at this post. The explanation makes a lot sense to me, but I am not not sure how to implement as it suggested. I was wondering if there are any related code examples that I can refer to? Thank you!

ryanwesslen · November 28, 2022, 6:11pm

Have you seen this post?

It outlines how to approach this with a custom recipe.

Topic		Replies	Views
Best practise for multi-label and textcat.teach usage , textcat	6	4834	May 2, 2019
Textcat - teach to train. usage , textcat	2	553	September 1, 2022
Incorrect textcat.batch-train example in documentation? docs , done	2	855	August 10, 2018
Multi label tagging usage , textcat	1	1180	September 10, 2018
textcat teach examples from source or from dataset usage , textcat	10	839	August 15, 2019

textcat.batch-train question

Related topics