Hi @info2000!
For your three new questions, I wasn't sure if you're asking for only textcat.teach
or Prodigy in general. The answers below should be the same for both.
I'm sorry, I don't understand the question. Can you rephrase?
If you're interested in the default behavior for deduplication, you may find the Docs for the filter_duplicates
function to be helpful or in the answers below.
For two texts where only difference is one space, would have different values for their input_hash
.
Since they have different input_hash
values (and also different task_hash
), they would be treated as two different samples.
Let's assume for both sessions, you'll set your dataset to the same Prodigy dataset.
For annotation, by default records are deduped by task_hash
, see "exclude_by"
field in Configuration. Therefore, if you're doing the same task (e.g., textcat
recipe) you did in the first 100 samples, then only 20 samples would be shown (30 repeated tasks would be excluded).
But if the 2nd annotation session used a different task (e.g., different recipe like ner
), then all 50 samples would be shown.
Sorry we haven't responded yet to this. Are you still interested in an explanation for this? As Vincent had recommended, if we have a reproducible example, it makes it a lot easier for us to debug.
I've found some past posts that may explain shed some light. While uncommon, it's not impossible to see low scores when using the default ema
algorithm:
if you get a long sequence of low-scoring examples, the probability sorter will ask you fewer questions from them, while the exponential moving average sorter will get impatient, and start asking you questions even if the scores are low.
As an alternative, you may want to consider using algorithm = 'probability'
as the post you originally cited mentioned:
On the other hand, if you know the target class is rare, you want the sorter to “believe” the scores much more. In this case the probability sorter is better.
I also found the discussion here to be helpful: