I am utilziing textcat.teach at the moment, with the end goal of training a model.
I have collected a whole bunch of patters (2000 patterns, split across 10 mutually exclusive topics), which I did using sense2vec with some seed terms.
Now I have a stream of sentences (65,000) that is showing up in prodigy where we are also using the patterns file created above.
Whats happening is that I am annotating a few tasks, and then it says out of tasks after 10 or 20 or something like that. But at this point I had only done 20 out of 60,000 ! So I refreshed the page and there were more tasks.
So now I am annotating until the tasks run out and then refreshing prodigy for a new batch. I am saving along the way.
Is this normal or expected behaviour ?
Am I doing something wrong?
If this is a bug or something from a large number of patterns or input files, is it okay if I just keep doing this?
Just an update here. I had exported the wrong dataset into prodigy for annotation. the one I had used was 600,000 entries vs 60,000
Still the same thing, albeit a bit better
Hi! Which version of Prodigy are you using?
This is definitely not expected behaviour – I remember something similar coming up in a case where the stream ended up timing out and sent an empty batch because processing the next batch was taking very long
The number of patterns shouldn't be a problem in this case because it's not actually that much on the scale of things. But I do wonder if spaCy's matcher uses a larger batch size somewhere nad pre-processes more examples than needed for the next batch. Prodigy v1.10.8 introduces a batch size setting for the
PatternMatcher that should prevent this, but if you're on a previous version, maybe it could be related to that.
Alternatively: Which file format are you using to load your examples? Are you saving them to a JSONL that's read in line-by-line? The fact that the size of the input file makes a difference is definitely interesting...
Usng JSONL for the pattern files. But you're right the actual file size is not that significant.
An update (sort of) is that with logging I can see that its getting the next batch and what I am doing now is when it shows no tasks available, I just go back one (del)/undo and then accept or reject it again and the next stream loads. I mean it works ! ¯\(ツ)/¯