Best use of `textcat.teach`

JBunr · June 18, 2020, 7:27am

I am experimenting with textcat.teach to see if it would better to use this approach as opposed to textcat.manual.

I run the below:

prodigy textcat.teach my_test ja_core_news_lg text.jsonl --label Pos,Neg,Neut

text.jsonl contains 1060 lines, each with no preassigned classification.

I am expecting prodigy to load each sentence in the interface and present the annotator with one of my labels, and the annotator would assign an Accept/Decline label to each sentence/label pair.

However, after 60 sentences, i am presented with a "No tasks available." message.

Am I using this recipe in error, or is this behaviour expected in teach recipes?
Also what is the termination criteria that results in "No tasks available." being displayed?

JBunr · June 18, 2020, 9:02am

Looking at the other support inquiries, spacy does not load all sentences for annotation. I ran training based on the 60 cases that i annotated, and got the below results.

They are disappointing, but given the limited training examples its unsurprising, but why is it the case that scrapy concludes that 60 annotations is sufficient?

$ prodigy train textcat my_test ja_core_news_lg
✔ Loaded model 'ja_core_news_lg'
Created and merged data for 60 total examples
Using 30 train / 30 eval (split 50%)
Component: textcat | Batch size: compounding | Dropout: 0.2 | Iterations: 10
ℹ Baseline accuracy: -1.000

=========================== ✨  Training the model ===========================

#    Loss       F-Score 
--   --------   --------
1    27.03      -1.000                                                                                                                        
...
10   4.34       -1.000                                                                                                                        
============================= ✨  Results summary =============================
Label    ROC AUC
------   -------
Other   0.528
Pos      -1.000
Neg     -1.000
Neut    -1.000

ines · June 18, 2020, 9:26am

The idea of the textcat.teach recipe is that it uses the model in the loop to select the most relevant examples for annotation, based on the score (e.g. pioritising the examples with a score closest to 0.5, as those may be the most "uncertain" predictions). This also means that the recipe will skip examples with high and low scores, so you're not going to see all examples in your dataset. The recipe will use an exponential moving average go decide which scores to consider. This prevents Prodigy from getting stuck if the model ends up in a state where it produces more high/low scores etc.

If you're starting completely from scratch with a new model and you're annotating labels that might not be equally distributed, this workflow can be less effective because your model knows nothing. And it would take very long to get enough examples of all labels to teach it something meaningful so it can actually "participate" properly.

So it might make sense to start with a manual workflow like textcat.manual and annotate a small sample from scratch. You can then pretrain your model on that to give it a head-start. It can also help to use --patterns on textcat.teach to make sure that pattern matches are always shown if they occur (e.g. to show examples that may be part of rarer classes).

Topic		Replies	Views
No tasks available in v1.10 - texcat.teach usage , textcat	4	851	June 28, 2020
textcat teach examples from source or from dataset usage , textcat	10	862	August 15, 2019
textcat.teach showing same text twice (and not using active learning?) textcat	15	2310	August 15, 2018
Only 25 lines loading from my .jsonl stream usage , textcat	3	921	July 30, 2019
From Choice annotations to binary annotations with Teach usage , textcat , spacy	4	1001	January 2, 2019

Best use of `textcat.teach`

Related topics