Duplicated annotation when changing version

Dear Prodigy team,

I have a question regarding duplicated annotations in multi-session mode.

Our setup is as follows: we have several annotators and run sessions on different days on the same input data and we save annotations to the same dataset (we stop the server each evening and restart the next morning).
We want that each annotation task is only labeled once. If an example was labeled on day 1 by one of the annotators, it should not be presented to any of the other annotators neither on this day nor on any other day in the future.

We have been running our own recipe for annotations on prodigy version 1.10.7 until recently.
To ensure that each example is annotated once we set in the .prodigy.json:

"feed_overlap": false

In the past we did not have many duplicated annotations but since we upgraded to prodigy version 1.11.8 last week we are seeing a lot of duplicates. Especially after we stop and restart the server, the examples seem to start from the beginning.

We used to create our stream with

stream = JSONL(source)
stream = add_tokens(nlp, stream)

which seemed not to generate duplicates in the old version.

Do we instead need to use

stream = get_stream(
source, loader=loader, rehash=True, dedup=True, input_key="text"
)
stream = add_tokens(nlp, stream)

to ensure that we don't get the same examples multiple times for annotating?

Or do we need to use the

"exclude"

option in the recipe and pass the dataset name in order to exclude annotations from the previous days?

I am a bit confused which setting to use for what and would really appreciate your help.
Thanks a lot.

PS: Here is our code for our custom ner manual recipe:

import configparser
import logging
import os
from collections import Counter
from pathlib import Path
from typing import List, Optional

import prodigy
import spacy
from prodigy.components.loaders import JSONL
from prodigy.components.preprocess import add_tokens
from prodigy.util import split_string


@prodigy.recipe(
   "ner.manual_stats",
   dataset=("The dataset to use", "positional", None, str),
   spacy_model=("The base model", "positional", None, str),
   source=("The source data as a JSONL file", "positional", None, str),
   label=("One or more comma-separated labels", "option", "l", split_string)
)
def ner_manual_stats(
   dataset: str,
   spacy_model: str,
   source: str,
   label: List[str]
):
   """
   Mark spans manually by token. Requires only a tokenizer and no entity
   recognizer, and doesn't do any active learning.
   """
   # Load the spaCy model for tokenization
   nlp = spacy.load(spacy_model)
   stats_session=Counter()
   stats_dataset_db=Counter()

   # Load the stream from a JSONL file and return a generator that yields a dictionary for each example in the data.
   stream = JSONL(source)

   # Tokenize the incoming examples and add a "tokens" property to each example.
   stream = add_tokens(nlp, stream)

   def on_load(controller):
       # Check if current dataset is available in database. The on_load callback receives the controller as an
       # argument, which exposes the database via controller.db
       if dataset in controller.db:
           examples = controller.db.get_dataset(dataset)
           for eg in examples:
               stats_dataset_db[eg["answer"]] += 1
               if "spans" in eg.keys():
                   for span in eg["spans"]:
                       stats_dataset_db[span["label"]] += 1

   def update(answers):
       nonlocal stats_session
       for eg in answers:
           stats_session[eg["answer"]] += 1
           if "spans" in eg.keys():
               for span in eg["spans"]:
                   stats_session[span["label"]] += 1

   def on_exit(controller):
       logger = logging.getLogger()
       logger.setLevel("INFO")
       sh = logging.StreamHandler()
       sh.setLevel(logging.DEBUG)
       logger.addHandler(sh)

       logger.info("Annotations previously stored in DB for dataset {}:".format(dataset))
       logger.info("Total:\t "+str(stats_dataset_db["accept"]+stats_dataset_db["reject"]+stats_dataset_db["ignore"]))

       if len(stats_dataset_db.keys()) != 0:
           logger.info("annotated entities in DB: ")
           for key, value in stats_dataset_db.items():
               if key not in ("accept", "reject", "ignore"):
                   logger.info("{}:\t {}".format(key, value))

       logger.info("Annotations for this session:")
       logger.info("Total:\t "+str(stats_session["accept"]+stats_session["reject"]+stats_session["ignore"]))

       if len(stats_session.keys()) != 0:
           logger.info("annotated entities for this session: ")
           for key, value in stats_session.items():
               if key not in ("accept", "reject", "ignore"):
                   logger.info("{}:\t {}".format(key, value))

   return {
       "view_id": "ner_manual",  # Annotation interface to use
       "dataset": dataset,  # Name of dataset to save annotations
       "stream": stream,  # Incoming stream of examples
       "update": update,  # List of dataset names to exclude
       "on_load": on_load,  # Called on first load
       "on_exit": on_exit,  # Called when Prodigy server is stopped
       "config": {  # Additional config settings, mostly for app UI
           "lang": nlp.lang,
           "labels": label,  # Selectable label options
       },
   }

hi @JulieSarah!

Thanks for your posts.

If it were me, I would start with trying to change the prodigy.json's exclude_by to input (its default is task). This will exclude any annotations that have the same input. Since you want only one annotation per text (input), this may do the trick.

The first one makes sense but you may find its still deduping simply by task. I haven't used the rehash much but I think that may not help as it would treat every new example like a new example (i.e., rehash it).

The exclude works great if you have know examples you want to exclude. The problem is if you're having simultaneously new examples annotated, your exclude list may not be up-to-date in realtime.

Related, how often are annotators answering simultaneously? But if you're having annotators who answer very quickly/simultaneously, I don't think this may be the sole reason you're having issues.

See this post:

We've been working this year on an experimental version that overhauls the database. The problem is there are many possible issues for duplications. It's a much harder problem than many realize until they get into it. One issue was corrected in 1.11.7 but several others still existed. That's interesting that you report the issue with the upgraded version. I'll make a note.

We're preparing soon (in a few weeks) to release v1.12 which will incorporate this new database. This is in addition to an early next year release of v2. We decided to push out v1.12 so teams like yours can test this new database before we provide the large overhaul of v2. One part of this change will move the ORM from peewee to SQLalchemy, any existing annotations will need to be migrated which is why we have a migration script.

For now, I would suggest trying the changes you mentioned above like exclude_by: input but knowing that you may still get duplicates. If you still do, then either try out the experimental version or wait a few weeks for the new v1.12 to come out, and test that.

Sorry there isn't a simple solution but hopefully you can see that there are many upcoming changes on the way to address these issues for the long-term :slight_smile:

Hi Ryan!

Thanks a lot for your quick reply.

We set "exclude_by": "input" in the prodigy.json as you suggested but it does not solve the problem.

It turns out that the problem actually also arises in single session mode and not only for multi-user sessions.
In prodigy version 1.11.8, I start the custom recipe locally with

prodigy ner.manual_stats my_dataset blank:en dummy_data.jsonl --label A,B,C -F recipe_manual_stats.py

and I annotate examples 1 to 10 at http://localhost:8080/
Then I stop the server and close the tab.
If I restart a bit later with the same command and open http://localhost:8080/ I have to annotate examples 1 to 10 again.

However, if I do the exact same steps with prodigy version 1.10.7, then the second annotation session will start with example 11 and I will not get to annotate examples 1 to 10 again.

It seems that in version 1.11.8, I need to change

stream = JSONL(source)
to
stream = get_stream(source, loader='jsonl', rehash=True, dedup=True, input_key="text")

in order to filter out examples of the current dataset which have been annotated in the past even though I set "auto_exclude_current": true

in the config.
This behaviour seems to be different to version 1.10.7.
Is this intended?

Thanks again.

Great. This is easier to isolate the problem on this level.

Do you click the save button after annotating but before closing out the browser and stopping the server?

If not, that could be the issue. However, I suspect you did just didn't mention it.

Related, for testing, you can set instant_submit: true to instantly submit answers to the database (the downside is there are not any saved in the history.

I'll need to check in to the version changes from 1.10.7. I'll get back if we can find any specific changes that are causing.

FYI "auto_exclude_current": true is the default. This has been the case since v1.6 so it would be true for both version (1.10.7 and 1.11.8).

Have you worked with Prodigy's logging tools? You can add in logging for with either PRODIGY_LOGGING=basic or PRODIGY_LOGGING=verbose. This can help you see all the steps and can be helpful.

Also, I'm a bit surprised "exclude_by": "input" didn't work. Can you check in the logging that it was implemented? Sometimes users who are modifying config settings can accidentally have multiple config files. I don't suspect this is the case but I've certainly done this several times when debugging. The prodigy logging helps to be 100% sure the config settings you're using are the correct ones.

Let me know if you can confirm you were saving annotations or if you can find out more from using Prodigy's logging.

Hi Ryan
I will reply more thoroughfully tomorrow but just to orient yourself:

I had no bug with previous version.
I have this bug with new version.

I am just running a dummy session,alone, without quitting, with new version and the custom recipe and I have duplicates.

I tell you this because I think you go on the wrong direction.

Have you some thoughts about the code in my custom recipe? We think it is due to the stream, could you help us on it?

Thank you

hi @JulieSarah!

Yes! You are correct. I had some time to run your custom recipe and was able to see the same issue with v1.11.8 when using JSONL. It was all about the get_stream.

I spoke with one of our core developers who confirmed around v1.11 (over a year ago), Prodigy switched to get_stream to give more flexibility beyond the JSONL loader. Using get_stream will also work with the new v1.12 that we will be releasing soon.

The move in a few months to v2 will also include a small change to get_stream to make it more typesafe as v2 with pydantic. It is only adding one new argument so it won't change a lot, but could have something significant so definitely be on the lookout when we release v2.

Given these changes and to prevent breaking changes like this for upcoming releases, we're starting to make preparations for both v1.12 and v2 communications which should include social media threads, updated docs, and a dedicated FAQ post on support. So keep a look out for details.

Hope this answers your question. Let us know if you have other questions!

Thanks @ryanwesslen to direct us to the right correction for our code :slight_smile:

We will test your modification then. Indeed yes it could be helpful for others to see this new implementation propagated on your documentation and logs.

Wishing you a great day

Julie