Textcat - same data keeps appearing

jsnleong · July 19, 2019, 8:32am

Hi,

I am currently training a text-classification model. My concern is that I have already annotated the label with Prodigy for about 300 texts in my data. When I would like to resume my annotation the next day, I am seeing the same texts that I have already indicated accept or reject to from yesterday.

So essentially, am I actually resuming my annotation process? or starting from scratch again?

Thanks!

ines · July 22, 2019, 2:58pm

Are you using the same dataset to store the annotations? Prodigy should automatically exclude the examples in the current dataset, so you shouldn't have to define --exclude name_of_dataset anymore.

The only thing to note here is that this will really only compare the exact questions – so if you're annotating different labels, you'd still see the same text but with a different label. Or if you're using patterns, you might see a different suggestion on the same text.

If you're starting with a blank model, the same input data and the same dataset, you're not fully resuming: you're starting again, but you're skipping examples that are already in the dataset. If you want to start again with the same model state, you usually want to run textcat.batch-train and then load the updated model into textcat.teach. This should give you a slightly better version of the model you had previously updated in the loop.

jsnleong · July 23, 2019, 1:11am

Hi @ines, I am indeed using the same dataset to store the annotations. I have 679 annotations so far, and I believe almost 100-200 of them are duplicates, because I remember seeing the exact same text twice or even thrice.

I exported the dataset to a .jsonl file to verify. And just by simply picking up a text, I realised that the EXACT whole chunk of text appears in line 308 and line 636 (in which both I have rejected). Could there be any reason why?

The _input_hash and _task_hash is as follows:
Annotation 308
_input_hash: 1962885252, _task_hash: 1962885252

Annotation 636:
_input_hash: 1962885252, _task_hash: 340824734

The _task_hash seems to be different, the label however is ensured to be the consistent.

Any idea? Thanks!

ines · July 23, 2019, 10:12am

Thanks for looking into this! So the underlying reason Prodigy shows you the example is most likely that the _task_hashes are different. Those end up being used to decide whether two questions are the same or not. Now the main thing to investigate is why two identical examples would have received different hashes.

Can you share an example of those two annotations? You can leave out the "text" value if you want to (if you’ve confirmed that they’re both identcial).

Topic		Replies	Views
Resume Annotation Session with Prodigy - Text Classification textcat	1	1641	June 14, 2018
Restart Text classification and want to add additional labels usage , textcat , solved	4	772	July 24, 2020
textcat.teach repeating data with --exclude flag set and trained model in the loop usage , textcat , solved	9	744	September 25, 2019
textcat.manual Duplicate Samples usage , textcat , done , streams	9	1592	June 5, 2020
textcat.teach seems to be asking questions that are already in the dataset textcat	2	323	April 21, 2022

Textcat - same data keeps appearing

Related topics