Duplicates in ner.manual

Thanks for the report! I think this might explain what's going on: the _view_id property was added more recently and it seems like it's taken into account when generating the hashes to decide whether two examples are the same or not. I think it probably makes sense to exclude the _view_id from the hashing, since I can't think of many use cases where you would want to treat the same task displayed in a different interface as different questions.

In the meantime, you can export the data, make sure all of the examples have "_view_id": "ner_manual" set, call prodigy.set_hashes(example, overwrite=True) on each example to make sure the hashes are updated, and then re-import it. (For the next release, we'll also add a --rehash flag to db-in that takes care of this automatically.)

This is currently expected, because Prodigy will filter based on the task hashes, i.e. the question. So if you annotate an example once with ner.manual (no pre-highlights) and then with ner.make-gold (pre-highlighted spans), both examples will have the same input hash but different task hashes, so they're treated as different questions on the same text.

For the next release, want to add a "filter_by" config setting that lets recipes specify whether to filter by task hashes (e.g. in binary recipes where you want to answer many accept/reject questions about the same text) or by input hashes (where you only want to annotate an example once).

In the meantime, you could add your own filter like this:

def filter_stream(stream):
    seen = set()
    for eg in stream:
        # Get the hash idenfitying the original input, e.g. the text 
        input_hash = eg["_input_hash"]
        if input_hash not in seen:
            yield eg
        seen.add(input_hash)

stream = filter_stream(stream)