duplicate example

aisha_harbi · September 8, 2022, 10:04am

Hi, the prodigy app is showing duplicate example multiple times, how do I stop this? Noting that I am using PRODIGY_ALLOWED_SESSION but only one user/session is currently is active and "feed_overlap": false in prodigy.json.

Thanks so much.

koaning · September 8, 2022, 12:37pm

Hi Aisha,

there can be a few reasons why you might see duplicates. In Prodigy the task and the examples are hashed separately, and the combination of the two determines uniqueness. So it might be related to you how your tasks are defined in your examples. It might also be that you're running a custom recipe without running filter_duplicates or get_stream without dedup.

Could you share the Prodigy command that you're trying to run? Perhaps with a small example dataset so that I might reproduce locally?

aisha_harbi · September 8, 2022, 1:21pm

Thanks for replying!

I tired using

stream = prodigy.get_stream(source, dedup = True)

and I got the following error:

  for eg in stream:
  File "/home/ubuntu/relation_extraction/custom_relACE05.py", line 121, in add_tokens
    for eg in stream:
  File "cython_src/prodigy/components/loaders.pyx", line 29, in _add_attrs
  File "cython_src/prodigy/components/filters.pyx", line 48, in filter_duplicates
KeyError: '_input_hash'

koaning · September 8, 2022, 1:24pm

You can use prodigy.set_hashes to specify the hashes that will be used for deduplication. Did you apply that beforehand?

Got an example of an item in your source?

aisha_harbi · September 8, 2022, 1:44pm

It consists of 3 keys: tokens, text, and relations_x which consist of arg1 and arg2 and the relation's type (in addition to the arg text, type and span) what the custom recipe does is add entity spans and labels (NER labels from arg1 and arg2 types) and then adds the relations within relations_x
the output fed to the UI similar to this:

{
  "text": "My mother’s name is Sasha Smith. She likes dogs and pedigree cats.",
  "tokens": [
    {"text": "My", "start": 0, "end": 2, "id": 0, "label": O},
    {"text": "mother", "start": 3, "end": 9, "id": 1, "label": LOC},
   ....
],

    {"text": "pedigree", "start": 52, "end": 60, "id": 12, "ws": true},
  "relations": [
    {
      "head": 0,
      "child": 1,
      "label": "POSS",
      "head_span": {"start": 0, "end": 2, "token_start": 0, "token_end": 0, "label": null},
      "child_span": {"start": 3, "end": 9, "token_start": 1, "token_end": 1, "label": null}
    },
    {
      "head": 1,
      "child": 8,
      "label": "COREF",
      "head_span": {"start": 3, "end": 9, "token_start": 1, "token_end": 1, "label": null},
      "child_span": {"start": 33, "end": 36, "token_start": 8, "token_end": 8, "label": null}
    },
    {
      "head": 9,
      "child": 13,
      "label": "OBJECT",
      "head_span": {"start": 37, "end": 42, "token_start": 9, "token_end": 9, "label": null},
      "child_span": {"start": 52, "end": 65, "token_start": 12, "token_end": 13, "label": "NP"}
    }
  ]
}

Here's an example of the source input: it's in Arabic.

`{"tokens": ["\u0623\u0639\u0644\u0646", "\u0648\u0632\u064a\u0631", "\u062f\u0627\u062e\u0644\u064a\u0629", "\u0643\u0648\u062f\u064a\u0641\u0648\u0627\u0631", "\u0627\u0633\u062a\u0639\u0627\u062f.....],

"text": "\u0623\u0639\u0644\u0646 \u0648\u0632\u064a\u0631 \u062f\u0627\u062e\u0644\u064a\u0629 \u0643\u0648\u062f\u064a\u0641\u0648\u0627\u0631 \u0627\u0633\u062a\u0639\u0627\u062f\u0629 .....",

"relations_x": [
{"label": "PART-WHOLE:Geographical",
"arg1-span": [11, 12],
"arg1-text": "\u0623\u0631\u062c\u0627\u0621",
"arg1-type": "LOC:Region-General",
"arg2-span": [12, 13],
"arg2-text": "\u0627\u0644\u0628\u0644\u0627\u062f",
"arg2-type": "GPE:Nation"},
{"label": "ORG-AFF:Employment",
"arg1-span": [1, 2],
"arg1-text": "\u0648\u0632\u064a\u0631",
"arg1-type": "PER:Individual", "arg2-span": [2, 3],
"arg2-text": "\u062f\u0627\u062e\u0644\u064a\u0629",
"arg2-type": "ORG:Government"}, ...... }]}`

aisha_harbi · September 8, 2022, 2:12pm

@koaning never mind I think my dataset might already contain duplicate examples
Thanks!

koaning · September 8, 2022, 2:13pm

Ah! Yeah, that's also a possibility !

Topic		Replies	Views
Parameter "dedup" in get_stream function usage , solved	3	1012	November 4, 2019
Duplicate examples being served to annotators and being saved to DB solved , streams	11	591	August 30, 2020
Duplicated examples in db-out for ner.train usage , ner , database	6	380	October 11, 2022
Example in Components and Functions documentation doesn't work as expected docs , done	2	422	March 2, 2021
Duplicates in AudioVideo tasks usage , streams	2	387	November 4, 2021

duplicate example

Related topics