Duplicates in ner.manual


we are using ner.manual. However, we have a lot of duplicates.

Roughly, the first 1000+x annotations were made twice, as they appeared twice in prodigy and at first, no one noticed. What attracts attention, is that the first „set“ (first half of the duplicates) has got the _view_id ‚NaN‘ (None/NULL), the second half has got „ner_manual“.

However, we later annotated into the same dataset using ner.make-gold. This only produced duplicates by starting out from the beginning of the input file.

After having dropped the duplicates and re-inserted the data into an empty dataset, ner.manual starts anew with the input csv-file.

How can we handle duplicates and prevent them? It does cost a lot of time.

(The annotations have been done since April and we constantly upgraded prodigy, currently we run version 1.8.3. )


Thanks for the report! I think this might explain what's going on: the _view_id property was added more recently and it seems like it's taken into account when generating the hashes to decide whether two examples are the same or not. I think it probably makes sense to exclude the _view_id from the hashing, since I can't think of many use cases where you would want to treat the same task displayed in a different interface as different questions.

In the meantime, you can export the data, make sure all of the examples have "_view_id": "ner_manual" set, call prodigy.set_hashes(example, overwrite=True) on each example to make sure the hashes are updated, and then re-import it. (For the next release, we'll also add a --rehash flag to db-in that takes care of this automatically.)

This is currently expected, because Prodigy will filter based on the task hashes, i.e. the question. So if you annotate an example once with ner.manual (no pre-highlights) and then with ner.make-gold (pre-highlighted spans), both examples will have the same input hash but different task hashes, so they're treated as different questions on the same text.

For the next release, want to add a "filter_by" config setting that lets recipes specify whether to filter by task hashes (e.g. in binary recipes where you want to answer many accept/reject questions about the same text) or by input hashes (where you only want to annotate an example once).

In the meantime, you could add your own filter like this:

def filter_stream(stream):
    seen = set()
    for eg in stream:
        # Get the hash idenfitying the original input, e.g. the text 
        input_hash = eg["_input_hash"]
        if input_hash not in seen:
            yield eg

stream = filter_stream(stream)

Okay. Thank you, Ines for the explanation!

1 Like

@blume Now shipped in v1.9 :smiley: ner.manual now sets "exclude_by": "input" by default, meaning that Prodigy will use only the raw text to decide whether two questions are identical. So if you've annotated a text before (with suggestions or without), you shouldn't be seeing it again (even if the suggestions are different).