No tasks available in v1.10 - texcat.teach

I have a news dataset of 70K headlines and I am trying to categorize them by topic (10 in total). I store the headlines in a txt file and also pass a set of patterns in jsonl format.

This is what I use to run prodigy:

prodigy textcat.teach my_dataset en_core_web_lg data/my-data.txt --label t1,t2,t3,... --patterns data/my-patterns.jsonl --loader txt

I associate around 20 pattern words to each topic.

Irrespective of whether or not I run things with the patterns jsonl I keep getting a No tasks available prompt after around 20-30 annotations.

With 10 topics and 70K headlines I doubt the active learning is good enough to disambiguate between all the topics?

Many thanks for any help! :blush:

Hi! If your .txt actually has 70k unique lines, it definitely sounds a bit suspicious that it exits so early. Skipping examples is expected (and kind of the point of using textcat.teach), but it should only happen once you've moved through all the batches. When you see "no tasks available" and reload the browser, do you get new examples? This could indicate that something on the server is taking too long and the queue runs out before it gets new examples :thinking:

Otherwise, what you describe here sounds similar to the question in this thread so I'd probably also give similar advice:

If you have a lot of patterns, I think it'd probably be a good idea in your case to start with some basic pattern matching with prodigy match, collect some initial data to give the model a head-start and get a feeling for how common/rare the individual labels are.

I do indeed get new examples after a reload. For now I have been annotating 20-30 examples and then re-loading to get a new batch. Could it be because I'm using the large model en_core_web_lg ? :thinking:

Ah, that's interesting! The model size shouldn't make a big difference, because it shouldn't be that much slower at inference, it just takes slightly longer on on initial load :thinking:

How long are your texts on average? If they are quite long, can you split them into smaller chunks? You could also experiment with a larger batch_size, so more examples are queued up at once

Text sizes are small, ~5-50 tokens. Here is their distribution:

I've just labelled 1,000 examples (I love prodigy! :tada:) and have discovered that after a while the variance of the number of annotations before "No tasks available" pops up begins to vary more widely. Sometimes it appears as little as 2 or 3 annotations in, other times as many as 100+. To be honest, it's not really that annoying at all - I just reload the page whenever it happens and get back to annotating.

I'll look into the batch_size parameter next.

1 Like