textcat.teach stream of data is linear ?

OtterMe · March 14, 2023, 3:53pm

Hello !
I'm currently working on a sentence classifier. My data pool is a dataset composed of sentences (text) from different document/pages (meta).
After using the manual recipe and then training my model, I go to the teach recipe.
I would expect the sentences to be ambiguous one with around 0.5 score and from different documents/pages. However the stream of data is very linear (document A page 1 -> 2 -> 3...document B page 1 -> 2....) with heterogenous probabilities.
Is there something that I don't understand stopping prodigy to stream the data inside a teach job as Document D page 100 -> Document Z page 1 ->Document A page 10 with all probabilities around 0.5 or so ?
Thank you very much for your help !

ryanwesslen · March 14, 2023, 7:12pm

hi @OtterMe!

Thanks for your question and welcome to the Prodigy community

Hmm... yes, that doesn't seem right. Here's a good post that describes how Prodigy's default active learning strategies work:

Can you provide the code and/or a reproducible example? Obviously, what I'd need is the model which isn't possible, but we can still debug.

Just curious, how are you structuring your data?

If it were me, for very long documents, I'd start with something like this:

{"text": "This is my first sentence.", meta: {"document": "A", "page": 0, "sentence": 0}}
{"text": "This is my second sentence.", meta: {"document": "A", "page": 0, "sentence": 1}}
...
{"text": "This is a sentence later on.", meta: {"document": "D", "page": 100, "sentence": 14}}
...
{"text": "This is my last sentence.", meta: {"document": "Z", "page": 999, "sentence": 99}}

All of the "text" keys are by sentence (e.g., perhaps you ran split_sentences before starting textcat.manual). Also by putting the document/page/sentence, hopefully you could always identify where the sentence is.

If you're still having issues, you can look "under the hood" at what textcat.teach is doing to try to debug what's going wrong (or customize it too). You can find a simple version of textcat.teach on our projects repo. This may be a good one to try as it's easier to play around with it (e.g., trying different sorters).

Alternatively, you can look at the actual recipe by find the textcat.py recipe that's inside of your installed Prodigy library. You can find the location of that library by finding Location: in python -m prodigy stats and then looking within the /recipes folder.

OtterMe · March 14, 2023, 8:18pm

Thank you @ryanwesslen for such a quick reply !

My input data looks like this with about 10^5 entries :

{"text":"Sentence extracted using split_sentences indeed.","meta":{"document":A,"page":"1"}}

First I ran:

 prodigy textcat.manual annotated_db ./assets/input_data.jsonl --label MY_LABEL

Then:

python -m prodigy train ./output_dir --eval-split 0.2 --textcat-multilabel annotated_db --base-model en_core_web_sm

Finally:

prodigy textcat.teach ./output_dir/model-best annotated_db ./assets/input_data.jsonl --label MY_LABEL

The model is actually quite good after the manual step (about 200 annotations and 0.8 score) but as mentioned the behaviour during the teaching step is the same as manual. I do have a score provided for each sentences, but it reads my jsonl from entry 1 to entry n.

I'll take a look under the hood, because I might I've missed something since I haven't seen until now the prefer_uncertain thing.

Topic		Replies	Views
Scoring and sorting all samples during textcat teach usage , textcat	2	534	November 2, 2020
textcat.teach to show all the docs in stream, despite their score textcat , spacy	5	578	August 7, 2018
textcat.teach not using active learning textcat , solved	9	1396	April 10, 2018
textcat.teach uncertain sorter show options with score 0 usage , textcat	3	390	August 30, 2022
textcat.correct not streaming data.	4	198	August 18, 2023

textcat.teach stream of data is linear ?

Related topics