textcat.teach uncertain sorter show options with score 0

ryanwesslen · August 30, 2022, 5:55pm

Hi @info2000!

For your three new questions, I wasn't sure if you're asking for only textcat.teach or Prodigy in general. The answers below should be the same for both.

I'm sorry, I don't understand the question. Can you rephrase?

If you're interested in the default behavior for deduplication, you may find the Docs for the filter_duplicates function to be helpful or in the answers below.

For two texts where only difference is one space, would have different values for their input_hash.

Since they have different input_hash values (and also different task_hash), they would be treated as two different samples.

Let's assume for both sessions, you'll set your dataset to the same Prodigy dataset.

For annotation, by default records are deduped by task_hash, see "exclude_by" field in Configuration. Therefore, if you're doing the same task (e.g., textcat recipe) you did in the first 100 samples, then only 20 samples would be shown (30 repeated tasks would be excluded).

But if the 2nd annotation session used a different task (e.g., different recipe like ner), then all 50 samples would be shown.

Sorry we haven't responded yet to this. Are you still interested in an explanation for this? As Vincent had recommended, if we have a reproducible example, it makes it a lot easier for us to debug.

I've found some past posts that may explain shed some light. While uncommon, it's not impossible to see low scores when using the default ema algorithm:

if you get a long sequence of low-scoring examples, the probability sorter will ask you fewer questions from them, while the exponential moving average sorter will get impatient, and start asking you questions even if the scores are low.

As an alternative, you may want to consider using algorithm = 'probability' as the post you originally cited mentioned:

On the other hand, if you know the target class is rare, you want the sorter to “believe” the scores much more. In this case the probability sorter is better.

I also found the discussion here to be helpful:

Topic		Replies	Views
get_session_questions takes many time when use a sorter and always return same example usage , textcat	6	419	May 31, 2022
Textcat with customer sorter didn't exclude dataset textcat	1	390	March 20, 2020
Validating relevance of text to multiple topic labels using textcat.manual textcat	2	222	April 26, 2023
using sorters (prefer_uncertain or prefer_high_scores) result in prodigy showing me the same data samples with different predictions usage , streams	3	585	March 13, 2021
textcat.teach presents same annotation task if text snippet contains multiple patterns enhancement , usage , textcat , solved	11	1668	June 3, 2019

textcat.teach uncertain sorter show options with score 0

Related topics