Duplicates in ner.manual

ines · December 4, 2019, 8:40pm

Thanks for the report! I think this might explain what's going on: the _view_id property was added more recently and it seems like it's taken into account when generating the hashes to decide whether two examples are the same or not. I think it probably makes sense to exclude the _view_id from the hashing, since I can't think of many use cases where you would want to treat the same task displayed in a different interface as different questions.

In the meantime, you can export the data, make sure all of the examples have "_view_id": "ner_manual" set, call prodigy.set_hashes(example, overwrite=True) on each example to make sure the hashes are updated, and then re-import it. (For the next release, we'll also add a --rehash flag to db-in that takes care of this automatically.)

This is currently expected, because Prodigy will filter based on the task hashes, i.e. the question. So if you annotate an example once with ner.manual (no pre-highlights) and then with ner.make-gold (pre-highlighted spans), both examples will have the same input hash but different task hashes, so they're treated as different questions on the same text.

For the next release, want to add a "filter_by" config setting that lets recipes specify whether to filter by task hashes (e.g. in binary recipes where you want to answer many accept/reject questions about the same text) or by input hashes (where you only want to annotate an example once).

In the meantime, you could add your own filter like this:

def filter_stream(stream):
    seen = set()
    for eg in stream:
        # Get the hash idenfitying the original input, e.g. the text 
        input_hash = eg["_input_hash"]
        if input_hash not in seen:
            yield eg
        seen.add(input_hash)

stream = filter_stream(stream)

Topic		Replies	Views
Duplicate examples shown after restarting server done	4	1136	January 17, 2022
Duplicated examples in db-out for ner.train usage , ner , database	6	380	October 11, 2022
Duplicates in revised annotations usage	2	574	May 29, 2019
Duplicated examples in NER.teach & large jsonl files usage , ner , done	5	1437	September 10, 2018
ner silver-to-gold resulted in annotating the same objects multiple times bug , ner	3	815	December 13, 2021

Duplicates in ner.manual

Related topics