Help! I have duplicates or missing data: Best practices on accounting for annotations

Scaling annotations can be really hard. More scale means more complexity, which can make accounting for all of those annotations challenging, time-consuming, overwhelming, error-prone, or even impractical.

Unfortunately, many ML engineers and developers don’t realize this until they scale their annotations, risking wasted time and effort after labeling hundreds or thousands of annotations. As you dig deeper, you'll also find trade-offs from memory to network/latency to human-centered issues with annotators. In Prodigy, we've set smart defaults to minimize problems with duplicates and missing data, but sometimes there are edge cases that require you to make a choices for your workflow.

We're developing this doc to help guide you through the challenging world of annotation accounting. the good news: we've been working on this and want to develop more detailed documentation to help you avoid issues.

If you think you have duplicate or missing records, before posting a new thread, please review this checklist of items as you may find an answer to your problem. If not, we will be able to help you much faster after having ruled out these common hiccups that can occur.

1. Check your Prodigy version

You can get this by running prodigy stats. If you submit a ticket for anything, please provide this output.

$ prodigy stats

============================== ✨  Prodigy Stats ==============================

Version          1.12.4                        
Location         /Users/ryan/Documents/prodigy/venv/lib/python3.9/site-packages/prodigy
Prodigy Home     /Users/ryan/.prodigy          
Platform         macOS-13.4.1-arm64-arm-64bit  
Python Version   3.9.17                        
Spacy Version    3.6.0                         
Database Name    SQLite                        
Database Id      sqlite                        
Total Datasets   3                            
Total Sessions   6      

Tip: Did you know that you can view all of your Prodigy built-in recipes locally? Since Prodigy is just a Python library, the Location: path in your prodigy stats will show you where your Prodigy library is installed and, thus, where your built-in recipes are located. Look for the prodigy/recipes folder. :rocket:

We've fixed many bugs, added new features, and improved our logging to avoid dealing with missing or duplicate records. If you're using an earlier version of Prodigy, your issue may have been fixed or been affected by more recent changes. Here's a list of recent fixes and changes (as of July 2023, see changelog for more recent updates):

v1.12.3 2023-07-17:

  • Fix an issue where the “Save” button could be clicked twice and save a duplicate answer to the database.
  • Add logging from the frontend to the backend if the frontend ever receives a batch with duplicate tasks
  • Fix a front-end issue that prevents duplicate examples from being shown to annotators, specifically in higher latency, production scenarios.

v1.12.0 2023-07-05

  • Added a new Controller to facilitate annotation workflow customization.
  • Added support for task routing, allowing you to customise who annotates each example.
  • Added annotations_per_task setting to easily configure a task router for partial annotator overla
  • Added allow_work_stealing setting in prodigy.json that allows you to turn off work stealing.

v1.11.9 2023-01-23:

  • Fix an issue where some unsaved examples could be lost during a browser refresh.

v1.11.8 2022-07-20:

  • Automatically prevent duplicates from appearing in training and evaluation set in train and data-to-spacy.

v1.11.4 2021-09-13:

  • Fix issue that could cause stream to repeat batches of questions in some scenarios.

v1.9.0 2019-12-18:

  • The force_stream_order config setting is now deprecated and the default behavior of the feeds. Batches are now always sent and re-sent in the same order wherever possible.
  • Fix issue that could cause next batch to be blocked when using "instant_submit": true .

v1.8.5 2019-10-19:

  • Warn after exhausting streams with many duplicates.

2. If you're missing data, do you have duplicates in your input (source)?

A common problem for new users is not realizing by default Prodigy's built-in recipes dedupe your source file. To dedup, Prodigy's built-in recipes automatically hash your source file.

When a new example comes in, Prodigy assigns it two hashes: the input hash and the task hash . Both hashes are integers, so they can be stored as JSON with each task. Based on those hashes, Prodigy is able to determine whether two examples are entirely different, different questions about the same input, e.g. text, or the same question about the same input.

For the docs for more details or these posts on how hashing works:

3. If you're missing annotations or have duplicates, did your annotators remember to save at the end of their session?

When you load a source file to Prodigy for annotation, Prodigy doesn't serve all the records at the same time. Instead, Prodigy will request batches of examples to annotate (by default, batch_size is set to 10), and send back batches of answers in the background (or whenever you hit save). As you annotate, batches of examples will be cleared from your outbox and sent back to the Prodigy server to be saved in the database. You can also hit the save button in the top-left corner or press COMMAND+S (on Mac) or CTRL+S to clear your outbox and manually save.

What's challenging is sometimes you may have annotators who are confident they completed some annotations, however, those batches are not in your database. Unfortunately for many new annotators, it's very common for annotators to forget to click the Save button when they've finished their last batch. This causes issues as annotators are confident they did the annotations but you may not see those annotations in the database. This is where logging can also help as you can verify whether records were actually sent to the database or not.

If you're worried that your annotators may not remember to save their last batch, one alternative is to modify your configuration and set instant_submit: true to instantly submit answers to the database. However, the downside is there are not any saved in the history so annotators can't go back. Like almost everything, there are trade-offs.

Alternatively, duplicates can happen too if annotators forget to save their last batch, leave their browser for a some time. Work stealing can then re-assign those apparent missed annotations to another annotator. Then the original annotator returns back to their browser and realizes they never saved their annotations, saves them, and now you have duplicates.

This is where logging can help immensely to keep track of exactly when annotations where made. For example, you can see when work stealing occurs in the logs if one session (let's call it steve) steals the item with hash 13687435343 from another session, frank:

07:34:18: POST: /get_session_questions
07:34:18: CONTROLLER: Getting batch of questions for session: steve
07:34:18: SESSION: steve has stolen item with hash 13687435343 from frank

4. If you have duplicates and using a custom recipe: are you using get_stream, not JSONL?

If you want to dedupe your source, avoid using JSONL and opt for get_stream when loading .jsonl files.

As of Prodigy 1.12 the recommended way to load a source is via the get_stream utility from the refactored Stream component. The recommended way to preprocess the stream e.g. in order to add tokens or split sentences is via Stream’s apply method.

JSONL will load files without hashing and deduplication. get_stream is used in built-in recipes and is how Prodigy's default hashing and deduplication is implemented.

It's important to note, especially older recipe examples (e.g., v1.11 or before 2021) may use JSONL, so be especially aware of this if you're using older version of Prodigy or recipes.

5. Modifying Prodigy's configuration

Have you modified your prodigy.json or any configuration overrides?

If so, make sure it’s doing what you intended and that your using the actual configuration you mean too (eg, global in your Prodigy Home, locally in your folder, or overrides).

If you're iterating on multiple prodigy.json files or config overrides, it can be easy to accidentally run an incorrect prodigy.json. Prodigy's logging can help as it will provide you details like what prodigy.json file you're using (12:58:16: CONFIG: Using config from global prodigy.json). To use it, simply add PRODIGY_LOGGING=basic for basic logging for PRODIGY_LOGGING=verbose for detailed logging. For example, basic will show you whether you're loading a global or local prodigy.json while verbose will show the exact path:

13:01:05: CONFIG: Using config from global prodigy.json
13:01:05: /Users/ryan/.prodigy/prodigy.json

But logging can also show you whether you've filtered duplicates, overrides, and many other helpful items. If you report any issues, you'll get a much faster resolution by providing your logs.

Example:

As a global config: ~/.prodigy/prodigy.json

{
    "feed_overlap": true
}

Or as an override

export PRODIGY_CONFIG_OVERRIDES='{"feed_overlap": true}'

First check your feed_overlap, which configures how examples should be sent out across multiple sessions. If true, each example in the dataset will be sent out once for each session , so you’ll end up with overlapping annotations (e.g. one per example per annotator). Setting "feed_overlap" to false will send out each example in the data once to whoever is available. As a result, your data will have each example labelled only once in total.

Here's a post that shows an example of how "feed_overlap": true vs. "feed_overlap": false (aka what "overlapping annotations" looks like):

Tip: When working with Prodigy config in VS Code, Vincent has a helpful trick to create a schema that enables autocomplete when modifying your prodigy.json. This can help reduce the change of typos (and headaches) in your config :slight_smile: .

If you're using Prodigy v1.12.0 or newer, we introduce Task Routing that provides the ability to provide partial overlap and customize how to distribute the annotation workload among multiple annotators. What's very important is to make sure to read the sections on why it's important to specify PRODIGY_ALLOWED_SESSIONS up front and also a use case of how you can still get duplicates, and a solution to prevent duplicates by checking the state in the database. Task Routing is considered a more advanced feature and we strongly recommend users carefully read the docs or watch Vincent's Task Routing tutorial video before using it in production or a large scale of annotations.

There are other important concepts to be aware and that can modify your annotation allocations like work stealing or setting annotations_per_task, so be sure to read through the Task Routing docs carefully or here are a few helpful posts:

I still have missing or duplicate data: what should I do?

Create a reproducible example for us. The more we can reproduce on our end, the faster we can diagnose bugs and provide you workarounds and fixes.

This would include:

  • Prodigy version: Show output for prodigy stats
  • The full prodigy command you're using like prodigy ner.manual my_ner blank:en ./news_headlines.jsonl --label ORG, LOCATION
  • If using a custom recipe, please provide it.
  • prodigy.json or any overrides. Remember, your prodigy.json location can be found by running prodigy stats.
  • Prodigy logs. verbose logs are ideal, but also be wary they may include your sensitive data so please scrub them (or provide important snippets).
  • Sample data -- ideally in .jsonl format. We recommend this file (which we use internally): nyt_text_dedup.jsonl (18.5 KB)

This file is nice for debugging missing/duplicates for two reasons. First, it's a deduped version of our news_headlines.jsonl file. There are exactly 176 records, each with a unique news headline.

Second, for each record, it includes as a meta key an index for each record, starting at 0 (we're Pythonistas of course :snake: ) up to 175. You can do this for any file -- which we highly recommend when debugging missing/duplicates -- just add a meta tag with the index like:

{"text": "The first sentence.", "meta": {"i": 0}}
{"text": "The second sentence.", "meta": {"i": 1}}
{"text": "The third sentence.", "meta": {"i": 2}}
...

By doing this - not only will you have a human-readable index for each record - but even annotators will see this index in their annotations. Therefore - annotators may see when a number is skipped or duplicated and can better report when/under what circumstances this is occurring.

1 Like

Hi Ryan - thanks for the walkthrough here.

Despite my best efforts, I'm still getting duplicates. I have a jsonl file with 88 rows, and no manner of trying to set hashes or deduplicate keeps the task pile to 88. Here's what I'm seeing: Loom | Free Screen & Video Recording Software | Loom

Here's sample data, the custom recipe, and prodigy.config

Please let me know if I should move this to another thread for troubleshooting my particular case or if you need any other information. Thank you!

hi @kevinw,

First off - big thank you for the detailed info. You're saving days of going back and forth so this is tremendously helpful.

Just curious - how are the hashed ID's "id": "12de6241-41d7-4d9c-b39b-e6cefa1753b1", being created? As a function of theimage_gcs_url with expired parts? I think your intuition is right this could be the culprit.

I'm wondering if you're duplicating Prodigy's hashing from set_hashes. Maybe it would make more sense to use for task hashing the "text" and the "gcs_uri" (aka document name) if that's what makes each .jsonl record unique. This way you wouldn't have to worry about expiring url parts while ensuring you have 88 unique hashes.

Also - one other suggestion simply for debugging, I'd recommend adding to your 88 examples a small meta tag counter like this file:

{"text":"This is the first example.","meta":{"i":0}}
{"text":"This is the second example.","meta":{"i":1}}

This way, it'll automatically show in your UI the meta counter. I'm curious after you finish (around 3:14 in your video) what records are coming up (e.g., does it start back to zero, repeat the same batch) and what records show up again after refreshing.

Few other questions:

  • what do your logs show when you exhaust your stream (as in 3:14) and when you refresh your browser?
  • can you export out (db-out) your annotated dataset like your example video? I'm curious to see the task hashes after the fact.
  • can you provide your 3_start_prodigy_UI.py file to kick off the server?

I'll try to share with teammates too to see if they have any suggestions.

@kevinw I couldn't help but notice a few small things.

The main thing that might be relevant is your set_hashes call in line 14. If you check the docs you'll notice the overwrite parameter. It seems that you didn't set it to True. This might help explain any hashing hiccups because without it set_hashes will just keep the existing hashes. You set rehash=True in get_stream so these will just be re-used. I'm not 100% sure, but this could be it.

I also couldn't help but notice that you use Text instead of text in your JSON blob. There's not a real harm in doing this, but it might be safer to use text in the future since it is what the other Prodigy recipes assume.

Thanks, Vincent and Ryan - appreciate you taking a look!

Unfortunately, still seeing the same behavior: Loom | Free Screen & Video Recording Software | Loom

Regarding the strange double run of the stats recipe in the logs, here are all the start up files

Regarding the ids: I was originally using the text as the hash, but I was getting duplicates. That's what started me down this path of trying to set my own hashes and trying to use the duplication filters. The current ids are from the uuid package, which I was hoping would be sufficiently unique :sweat_smile:
image

For good measure, I also just tried to run the task with text as the input key, but the duplicate behavior still exists.
I feel like this must be redundant to load the stream in three steps like this, but none of the functions/arguments/recipes have solved the problem, so they've piled up. I'm sure there's a better way to do it that this, but I'm a product manager so I'm a little out of my depth writing this much code :stuck_out_tongue:

stream = get_stream(source, dedup=True, rehash=True)
stream = (set_hashes(eg, input_keys=("text",), task_keys=("label",), ignore=("span", "input", "html", "image"), overwrite = True) for eg in stream)
stream = filter_duplicates(stream, by_input=True, by_task=True)

Thanks again for your help, and please let me know if there's anything else that I could provide or try on my end.

Just a debugging idea. What if you manually try to fetch the duplicates? Via something like:

mapper = {}
for ex in stream:
    input_hash = ex['_input_hash']
    if input_hash not in mapper:
        mapper[input_hash] = [] 
    mapper[input_hash].append(ex)

# Once the mapper is full, we can check input hashes with multiple objects
for input_hash, objects in mapper:
    if len(objects) > 1:
        print(objects)

This should allow you to spot examples that get the same _input_hash. When you run that, are you able to confirm there's no duplicate input hashes?

In the label results, the rows are identical except for the labeling timestamp. Its a little easier to see here instead of gist: prodigy custom task duplicates - Google Sheets

Would the behavior be different during the recipe as part of the stream like in your example?