No tasks available in v1.10 - texcat.teach

nifty · June 25, 2020, 11:55pm

I have a news dataset of 70K headlines and I am trying to categorize them by topic (10 in total). I store the headlines in a txt file and also pass a set of patterns in jsonl format.

This is what I use to run prodigy:

prodigy textcat.teach my_dataset en_core_web_lg data/my-data.txt --label t1,t2,t3,... --patterns data/my-patterns.jsonl --loader txt

I associate around 20 pattern words to each topic.

Irrespective of whether or not I run things with the patterns jsonl I keep getting a No tasks available prompt after around 20-30 annotations.

With 10 topics and 70K headlines I doubt the active learning is good enough to disambiguate between all the topics?

Many thanks for any help!

ines · June 26, 2020, 1:25pm

Hi! If your .txt actually has 70k unique lines, it definitely sounds a bit suspicious that it exits so early. Skipping examples is expected (and kind of the point of using textcat.teach), but it should only happen once you've moved through all the batches. When you see "no tasks available" and reload the browser, do you get new examples? This could indicate that something on the server is taking too long and the queue runs out before it gets new examples

Otherwise, what you describe here sounds similar to the question in this thread so I'd probably also give similar advice:

If you have a lot of patterns, I think it'd probably be a good idea in your case to start with some basic pattern matching with prodigy match, collect some initial data to give the model a head-start and get a feeling for how common/rare the individual labels are.

nifty · June 26, 2020, 11:36pm

I do indeed get new examples after a reload. For now I have been annotating 20-30 examples and then re-loading to get a new batch. Could it be because I'm using the large model en_core_web_lg ?

ines · June 27, 2020, 9:43am

Ah, that's interesting! The model size shouldn't make a big difference, because it shouldn't be that much slower at inference, it just takes slightly longer on on initial load

How long are your texts on average? If they are quite long, can you split them into smaller chunks? You could also experiment with a larger batch_size, so more examples are queued up at once

nifty · June 28, 2020, 1:25am

Text sizes are small, ~5-50 tokens. Here is their distribution:

I've just labelled 1,000 examples (I love prodigy! ) and have discovered that after a while the variance of the number of annotations before "No tasks available" pops up begins to vary more widely. Sometimes it appears as little as 2 or 3 annotations in, other times as many as 100+. To be honest, it's not really that annoying at all - I just reload the page whenever it happens and get back to annotating.

I'll look into the batch_size parameter next.

Topic		Replies	Views
Best use of `textcat.teach` usage , textcat	2	1433	June 18, 2020
"No tasks available" even though there's plenty of samples left usage , textcat	21	5505	September 13, 2021
textcat teach examples from source or from dataset usage , textcat	10	839	August 15, 2019
Pattern files for textcat.teach usage , textcat	20	3747	July 6, 2018
textcat.teach showing same text twice (and not using active learning?) textcat	15	2300	August 15, 2018

No tasks available in v1.10 - texcat.teach

Related topics