Same text appearing twice (with matches and without)

Yudi · August 26, 2020, 11:20am

Hi,

I'm using textcat.teach to annotate texts for a binary classification problem. I have defined my patterns to highlight in "--patterns ./patterns.jsonl". However, while I was annotating, I noticed that some texts appeared twice in Prodigy display interface, with defined patterns NOT highlighted for the first time and defined patterns highlighted for the second time. This led me annotating the same text twice.

Out of my 81 texts this happened 3 times. Texts with multiple matched patterns are all alright, but 3 texts with only one matched pattern have this issue.

Can you help me with this? Thanks you in advance!

Kind regards,
Yudi

ines · August 26, 2020, 5:21pm

Hi! Which version of Prodigy are you using? And if you're looking at the _input_hash and _task_hash generated for the two tasks that are identical (except for the match), are they the same or different?

It sounds like what might be happening is that for some reason, those tasks receive different hashes and Prodigy thinks they're different – when they should receive the same hashes, because their content should be treated as identical.

claycwardell · December 11, 2022, 9:56pm

I have noticed this also, although usually first with matched patterns, second time without the highlights. Using a the reddit comments corpus. It's possible that int the underlying data, these comments are there twice. Still, this wouldn't explain why the second time I get them, the interface doesn't highlight the patterns. That said, it would be kind of odd that the data was consistently duplicative but only for things that matched patterns.

Is this maybe a bug, with prodigy not properly marking a task as finished when it happens to have a matched pattern in it?
What should I do when I encounter this? Skip the second one?

claycwardell · December 11, 2022, 9:59pm

Evidence: Got this

Then a few tasks later, same session, this:

ryanwesslen · December 12, 2022, 4:43pm

hi @claycwardell!

Thanks for the background.

Ines' point is the consistent that likely these are getting different task hashes because they're different "tasks". The first is a pattern match (see PATTERN:40 in bottom right). While the second is a model based prediction (see SCORE:0.00 in the bottom right).

Can you look at these examples in your database and double check that the task_hash are different? You can either use db-out or pull the data directly from the database using get_dataset.

Try changing your configuration exclude_by to input. By default, Prodigy uses task_hash for deduplication. But if you change to input, then it'll dedup by input_hash, not task_hash. You can do this by either modifying your prodigy.json or adding in config overrides.

claycwardell · December 13, 2022, 6:49am

Thanks @ryanwesslen!

Topic		Replies	Views
textcat.teach presents same annotation task if text snippet contains multiple patterns enhancement , usage , textcat , solved	11	1668	June 3, 2019
textcat.teach repeatedly annotating the same text, not annotating entire text at once usage , textcat	1	623	November 22, 2019
textcat.manual Duplicate Samples usage , textcat , done , streams	9	1592	June 5, 2020
Same task presented for every pattern match enhancement , textcat	1	559	November 30, 2019
textcat.teach repeating data with --exclude flag set and trained model in the loop usage , textcat , solved	9	744	September 25, 2019

Same text appearing twice (with matches and without)

Related topics