Same text appearing twice (with matches and without)

Hi,

I'm using textcat.teach to annotate texts for a binary classification problem. I have defined my patterns to highlight in "--patterns ./patterns.jsonl". However, while I was annotating, I noticed that some texts appeared twice in Prodigy display interface, with defined patterns NOT highlighted for the first time and defined patterns highlighted for the second time. This led me annotating the same text twice.

Out of my 81 texts this happened 3 times. Texts with multiple matched patterns are all alright, but 3 texts with only one matched pattern have this issue.

Can you help me with this? Thanks you in advance!

Kind regards,
Yudi

Hi! Which version of Prodigy are you using? And if you're looking at the _input_hash and _task_hash generated for the two tasks that are identical (except for the match), are they the same or different?

It sounds like what might be happening is that for some reason, those tasks receive different hashes and Prodigy thinks they're different – when they should receive the same hashes, because their content should be treated as identical.

I have noticed this also, although usually first with matched patterns, second time without the highlights. Using a the reddit comments corpus. It's possible that int the underlying data, these comments are there twice. Still, this wouldn't explain why the second time I get them, the interface doesn't highlight the patterns. That said, it would be kind of odd that the data was consistently duplicative but only for things that matched patterns.

  1. Is this maybe a bug, with prodigy not properly marking a task as finished when it happens to have a matched pattern in it?
  2. What should I do when I encounter this? Skip the second one?

Evidence: Got this

Then a few tasks later, same session, this:

hi @claycwardell!

Thanks for the background.

Ines' point is the consistent that likely these are getting different task hashes because they're different "tasks". The first is a pattern match (see PATTERN:40 in bottom right). While the second is a model based prediction (see SCORE:0.00 in the bottom right).

Can you look at these examples in your database and double check that the task_hash are different? You can either use db-out or pull the data directly from the database using get_dataset.

Try changing your configuration exclude_by to input. By default, Prodigy uses task_hash for deduplication. But if you change to input, then it'll dedup by input_hash, not task_hash. You can do this by either modifying your prodigy.json or adding in config overrides.

Thanks @ryanwesslen!