Textcat - same data keeps appearing

Hi,

I am currently training a text-classification model. My concern is that I have already annotated the label with Prodigy for about 300 texts in my data. When I would like to resume my annotation the next day, I am seeing the same texts that I have already indicated accept or reject to from yesterday.

So essentially, am I actually resuming my annotation process? or starting from scratch again?

Thanks!

Are you using the same dataset to store the annotations? Prodigy should automatically exclude the examples in the current dataset, so you shouldn't have to define --exclude name_of_dataset anymore.

The only thing to note here is that this will really only compare the exact questions – so if you're annotating different labels, you'd still see the same text but with a different label. Or if you're using patterns, you might see a different suggestion on the same text.

If you're starting with a blank model, the same input data and the same dataset, you're not fully resuming: you're starting again, but you're skipping examples that are already in the dataset. If you want to start again with the same model state, you usually want to run textcat.batch-train and then load the updated model into textcat.teach. This should give you a slightly better version of the model you had previously updated in the loop.

Hi @ines, I am indeed using the same dataset to store the annotations. I have 679 annotations so far, and I believe almost 100-200 of them are duplicates, because I remember seeing the exact same text twice or even thrice.

I exported the dataset to a .jsonl file to verify. And just by simply picking up a text, I realised that the EXACT whole chunk of text appears in line 308 and line 636 (in which both I have rejected). Could there be any reason why?

The _input_hash and _task_hash is as follows:
Annotation 308
_input_hash: 1962885252, _task_hash: 1962885252

Annotation 636:
_input_hash: 1962885252, _task_hash: 340824734

The _task_hash seems to be different, the label however is ensured to be the consistent.

Any idea? Thanks!

Thanks for looking into this! So the underlying reason Prodigy shows you the example is most likely that the _task_hashes are different. Those end up being used to decide whether two questions are the same or not. Now the main thing to investigate is why two identical examples would have received different hashes.

Can you share an example of those two annotations? You can leave out the "text" value if you want to (if you’ve confirmed that they’re both identcial).