I have a news dataset of 70K headlines and I am trying to categorize them by topic (10 in total). I store the headlines in a txt file and also pass a set of patterns in jsonl format.
Hi! If your .txt actually has 70k unique lines, it definitely sounds a bit suspicious that it exits so early. Skipping examples is expected (and kind of the point of using textcat.teach), but it should only happen once you've moved through all the batches. When you see "no tasks available" and reload the browser, do you get new examples? This could indicate that something on the server is taking too long and the queue runs out before it gets new examples
Otherwise, what you describe here sounds similar to the question in this thread so I'd probably also give similar advice:
If you have a lot of patterns, I think it'd probably be a good idea in your case to start with some basic pattern matching with prodigy match, collect some initial data to give the model a head-start and get a feeling for how common/rare the individual labels are.
I do indeed get new examples after a reload. For now I have been annotating 20-30 examples and then re-loading to get a new batch. Could it be because I'm using the large model en_core_web_lg ?
Ah, that's interesting! The model size shouldn't make a big difference, because it shouldn't be that much slower at inference, it just takes slightly longer on on initial load
How long are your texts on average? If they are quite long, can you split them into smaller chunks? You could also experiment with a larger batch_size, so more examples are queued up at once
I've just labelled 1,000 examples (I love prodigy! ) and have discovered that after a while the variance of the number of annotations before "No tasks available" pops up begins to vary more widely. Sometimes it appears as little as 2 or 3 annotations in, other times as many as 100+. To be honest, it's not really that annoying at all - I just reload the page whenever it happens and get back to annotating.