I am utilziing textcat.teach at the moment, with the end goal of training a model.
I have collected a whole bunch of patters (2000 patterns, split across 10 mutually exclusive topics), which I did using sense2vec with some seed terms.
Now I have a stream of sentences (65,000) that is showing up in prodigy where we are also using the patterns file created above.
Whats happening is that I am annotating a few tasks, and then it says out of tasks after 10 or 20 or something like that. But at this point I had only done 20 out of 60,000 ! So I refreshed the page and there were more tasks.
So now I am annotating until the tasks run out and then refreshing prodigy for a new batch. I am saving along the way.
Is this normal or expected behaviour ?
Am I doing something wrong?
If this is a bug or something from a large number of patterns or input files, is it okay if I just keep doing this?
This is definitely not expected behaviour – I remember something similar coming up in a case where the stream ended up timing out and sent an empty batch because processing the next batch was taking very long
The number of patterns shouldn't be a problem in this case because it's not actually that much on the scale of things. But I do wonder if spaCy's matcher uses a larger batch size somewhere nad pre-processes more examples than needed for the next batch. Prodigy v1.10.8 introduces a batch size setting for the PatternMatcher that should prevent this, but if you're on a previous version, maybe it could be related to that.
Alternatively: Which file format are you using to load your examples? Are you saving them to a JSONL that's read in line-by-line? The fact that the size of the input file makes a difference is definitely interesting...
Usng JSONL for the pattern files. But you're right the actual file size is not that significant.
Using v1.10.8
An update (sort of) is that with logging I can see that its getting the next batch and what I am doing now is when it shows no tasks available, I just go back one (del)/undo and then accept or reject it again and the next stream loads. I mean it works ! ¯\(ツ)/¯