ner silver-to-gold resulted in annotating the same objects multiple times

,

Hi everyone,

I am writing because of an issue regarding a NER silver-to-gold workflow.
What we aim to achieve is basically to merge standard and binary annotation into a gold dataset.

The ppl annotating the sentences have been using this kind of syntax:

!python -m prodigy ner.silver-to-gold NER_gold_set silver_set ./tmp_model

(note that all the silver sets had been previously merged into a single jsonl file (manually), henceimported to the prodigy database under the name "silver_set")

Given the high number of sentences, the annotators have approached the task in multiple sessions.
However, they reported that every time they restarted the process they noticed that prodigy presented them with many sentences they thought they already annotated.

I checked the exported jsonl of their work on the gold dataset so far, and actually there are lots of duplicates (on average the same sentence has been annotated 3/4 times)..

How can we avoid this kind of duplication?

Thanks in advance!
atb
giovanni

Hi I am Valerio, the other person who is in charge of the implementation of the model in this project.

After some talks with the annotators, we realized that the problem lies in the absence in the recipe ner.silver-to-gold of a function similar to --exclude in ner.manual that allows to restart the annotation process from where you left it.

Ideally, we would not like to waste the work of the annotators and force them to re-annotate the sentences that they have already annotated. Is there a way we can sort this thing out?

Best wishes and many thanks from my side as well

Best

VV

Hi! In theory, the examples in the current dataset should be excluded by default, so unless you're changing datasets in between, it should work as expected. I wonder if there might be something else going on then – we had a recent report about problems with exclusion logic in recent versions, so we're currently investigating that.

In the meantime, you could easily add the exclude logic to the recipe yourself. If you run prodigy stats, you can find the location of your Prodigy installation. You can then edit the recipe in recipes/ner.py and add a function like this to the very end:

def exclude_examples(stream):
    task_hashes = DB.get_task_hashes(dataset)
    for eg in examples:
        eg = set_hashes(eg)
        if eg["_task_hash"] not in task_hashes:
            yield eg

stream = exclude_examples(stream)

Hi Ines,

thanks for your reply.

I checked with our collaborators, and they told me they are running a less recent version of prodigy (1.10.ish).

I had my colleague send over a copy of his ner.py file (I pasted it below).
They tried to include the function you suggsted but it doesn't appear to work. They tried pasting the function at the very end of the silver-to-gold recipe, but also before the 'return' as in the posted example, but neither of these seem to work.

I am sorry if this seems trivial, could you please point us towards the right way to implement your code?

thanks a lot

Valerio

@recipe(
    "ner.silver-to-gold",
    # fmt: off
    dataset=("Dataset to save annotations to", "positional", None, str),
    silver_sets=("Comma-separated datasets to convert", "positional", None, split_string),
    spacy_model=("Loadable spaCy model with an entity recognizer", "positional", None, str),
    label=("Comma-separated label(s) to annotate or text file with one label per line", "option", "l", get_labels),
    # fmt: on
)
def silver_to_gold(
    dataset: str,
    silver_sets: List[str],
    spacy_model: str,
    label: Optional[List[str]] = None,
) -> Dict[str, Any]:
    """
    Take existing "silver" datasets with binary accept/reject annotations,
    merge the annotations to find the best possible analysis given the
    constraints defined in the annotations, and manually edit it to create
    a perfect and complete "gold" dataset.
    """

    def filter_stream(stream: Iterable[dict]) -> Iterable[dict]:
        # make_best uses all labels in the model, so we need to filter by label here
        for eg in stream:
            eg["spans"] = [s for s in eg.get("spans", []) if s["label"] in labels]
            yield eg

    log("RECIPE: Starting recipe ner.silver-to-gold", locals())
    DB = connect()
    data = []
    for set_id in silver_sets:
        if set_id not in DB:
            msg.fail(f"Can't find input dataset '{set_id}' in database", exits=1)
        examples = DB.get_dataset(set_id)
        data += examples
    log(f"RECIPE: Loaded {len(data)} examples from {len(silver_sets)} dataset(s)")
    nlp = spacy.load(spacy_model)
    labels = label
    if not labels:
        labels = get_labels_from_ner(nlp)
        if not labels:
            msg.fail("No --label argument set and no labels found in model", exits=1)
        msg.text(f"Using {len(labels)} labels from model: {', '.join(labels)}")
    # Initialize Prodigy's entity recognizer model, which uses beam search to
    # find all possible analyses and outputs (score, example) tuples,
    # then merge all annotations and find the best possible analyses
    model = EntityRecognizer(nlp, label=labels)
    stream = model.make_best(data)
    stream = filter_stream(stream)
    stream = add_tokens(nlp, stream)  # add "tokens" for faster annotation

def exclude_examples(stream):
    task_hashes = DB.get_task_hashes(dataset)
    for eg in examples:
        eg = set_hashes(eg)
        if eg["_task_hash"] not in task_hashes:
            yield eg

    stream = exclude_examples(stream)

    return {
        "dataset": dataset,
        "view_id": "ner_manual",
        "stream": stream,
        "config": {"lang": nlp.lang, "labels": labels},
    }