duplicate example

Hi, the prodigy app is showing duplicate example multiple times, how do I stop this? Noting that I am using PRODIGY_ALLOWED_SESSION but only one user/session is currently is active and "feed_overlap": false in prodigy.json.

Thanks so much.

Hi Aisha,

there can be a few reasons why you might see duplicates. In Prodigy the task and the examples are hashed separately, and the combination of the two determines uniqueness. So it might be related to you how your tasks are defined in your examples. It might also be that you're running a custom recipe without running filter_duplicates or get_stream without dedup.

Could you share the Prodigy command that you're trying to run? Perhaps with a small example dataset so that I might reproduce locally?

Thanks for replying!

I tired using

stream = prodigy.get_stream(source, dedup = True)

and I got the following error:

  for eg in stream:
  File "/home/ubuntu/relation_extraction/custom_relACE05.py", line 121, in add_tokens
    for eg in stream:
  File "cython_src/prodigy/components/loaders.pyx", line 29, in _add_attrs
  File "cython_src/prodigy/components/filters.pyx", line 48, in filter_duplicates
KeyError: '_input_hash'

You can use prodigy.set_hashes to specify the hashes that will be used for deduplication. Did you apply that beforehand?

Got an example of an item in your source?

It consists of 3 keys: tokens, text, and relations_x which consist of arg1 and arg2 and the relation's type (in addition to the arg text, type and span) what the custom recipe does is add entity spans and labels (NER labels from arg1 and arg2 types) and then adds the relations within relations_x
the output fed to the UI similar to this:

{
  "text": "My mother’s name is Sasha Smith. She likes dogs and pedigree cats.",
  "tokens": [
    {"text": "My", "start": 0, "end": 2, "id": 0, "label": O},
    {"text": "mother", "start": 3, "end": 9, "id": 1, "label": LOC},
   ....
],

    {"text": "pedigree", "start": 52, "end": 60, "id": 12, "ws": true},
  "relations": [
    {
      "head": 0,
      "child": 1,
      "label": "POSS",
      "head_span": {"start": 0, "end": 2, "token_start": 0, "token_end": 0, "label": null},
      "child_span": {"start": 3, "end": 9, "token_start": 1, "token_end": 1, "label": null}
    },
    {
      "head": 1,
      "child": 8,
      "label": "COREF",
      "head_span": {"start": 3, "end": 9, "token_start": 1, "token_end": 1, "label": null},
      "child_span": {"start": 33, "end": 36, "token_start": 8, "token_end": 8, "label": null}
    },
    {
      "head": 9,
      "child": 13,
      "label": "OBJECT",
      "head_span": {"start": 37, "end": 42, "token_start": 9, "token_end": 9, "label": null},
      "child_span": {"start": 52, "end": 65, "token_start": 12, "token_end": 13, "label": "NP"}
    }
  ]
}

Here's an example of the source input: it's in Arabic.

`{"tokens": ["\u0623\u0639\u0644\u0646", "\u0648\u0632\u064a\u0631", "\u062f\u0627\u062e\u0644\u064a\u0629", "\u0643\u0648\u062f\u064a\u0641\u0648\u0627\u0631", "\u0627\u0633\u062a\u0639\u0627\u062f.....],

"text": "\u0623\u0639\u0644\u0646 \u0648\u0632\u064a\u0631 \u062f\u0627\u062e\u0644\u064a\u0629 \u0643\u0648\u062f\u064a\u0641\u0648\u0627\u0631 \u0627\u0633\u062a\u0639\u0627\u062f\u0629 .....",

"relations_x": [
{"label": "PART-WHOLE:Geographical",
"arg1-span": [11, 12],
"arg1-text": "\u0623\u0631\u062c\u0627\u0621",
"arg1-type": "LOC:Region-General",
"arg2-span": [12, 13],
"arg2-text": "\u0627\u0644\u0628\u0644\u0627\u062f",
"arg2-type": "GPE:Nation"},
{"label": "ORG-AFF:Employment",
"arg1-span": [1, 2],
"arg1-text": "\u0648\u0632\u064a\u0631",
"arg1-type": "PER:Individual", "arg2-span": [2, 3],
"arg2-text": "\u062f\u0627\u062e\u0644\u064a\u0629",
"arg2-type": "ORG:Government"}, ...... }]}`

@koaning never mind I think my dataset might already contain duplicate examples :confused:
Thanks!

Ah! Yeah, that's also a possibility :slight_smile: !