textcat.teach stream of data is linear ?

hi @OtterMe!

Thanks for your question and welcome to the Prodigy community :wave:

Hmm... yes, that doesn't seem right. Here's a good post that describes how Prodigy's default active learning strategies work:

Can you provide the code and/or a reproducible example? Obviously, what I'd need is the model which isn't possible, but we can still debug.

Just curious, how are you structuring your data?

If it were me, for very long documents, I'd start with something like this:

{"text": "This is my first sentence.", meta: {"document": "A", "page": 0, "sentence": 0}}
{"text": "This is my second sentence.", meta: {"document": "A", "page": 0, "sentence": 1}}
...
{"text": "This is a sentence later on.", meta: {"document": "D", "page": 100, "sentence": 14}}
...
{"text": "This is my last sentence.", meta: {"document": "Z", "page": 999, "sentence": 99}}

All of the "text" keys are by sentence (e.g., perhaps you ran split_sentences before starting textcat.manual). Also by putting the document/page/sentence, hopefully you could always identify where the sentence is.

If you're still having issues, you can look "under the hood" at what textcat.teach is doing to try to debug what's going wrong (or customize it too). You can find a simple version of textcat.teach on our projects repo. This may be a good one to try as it's easier to play around with it (e.g., trying different sorters).

Alternatively, you can look at the actual recipe by find the textcat.py recipe that's inside of your installed Prodigy library. You can find the location of that library by finding Location: in python -m prodigy stats and then looking within the /recipes folder.