Scaling annotations can be really hard. More scale means more complexity, which can make accounting for all of those annotations challenging, time-consuming, overwhelming, error-prone, or even impractical.
Unfortunately, many ML engineers and developers don’t realize this until they scale their annotations, risking wasted time and effort after labeling hundreds or thousands of annotations. As you dig deeper, you'll also find trade-offs from memory to network/latency to human-centered issues with annotators. In Prodigy, we've set smart defaults to minimize problems with duplicates and missing data, but sometimes there are edge cases that require you to make a choices for your workflow.
We're developing this doc to help guide you through the challenging world of annotation accounting. the good news: we've been working on this and want to develop more detailed documentation to help you avoid issues.
If you think you have duplicate or missing records, before posting a new thread, please review this checklist of items as you may find an answer to your problem. If not, we will be able to help you much faster after having ruled out these common hiccups that can occur.
1. Check your Prodigy version
You can get this by running prodigy stats
. If you submit a ticket for anything, please provide this output.
$ prodigy stats
============================== ✨ Prodigy Stats ==============================
Version 1.12.4
Location /Users/ryan/Documents/prodigy/venv/lib/python3.9/site-packages/prodigy
Prodigy Home /Users/ryan/.prodigy
Platform macOS-13.4.1-arm64-arm-64bit
Python Version 3.9.17
Spacy Version 3.6.0
Database Name SQLite
Database Id sqlite
Total Datasets 3
Total Sessions 6
Tip: Did you know that you can view all of your Prodigy built-in recipes locally? Since Prodigy is just a Python library, the
Location:
path in yourprodigy stats
will show you where your Prodigy library is installed and, thus, where your built-in recipes are located. Look for theprodigy/recipes
folder.
We've fixed many bugs, added new features, and improved our logging to avoid dealing with missing or duplicate records. If you're using an earlier version of Prodigy, your issue may have been fixed or been affected by more recent changes. Here's a list of recent fixes and changes (as of July 2023, see changelog for more recent updates):
- Fix an issue where the “Save” button could be clicked twice and save a duplicate answer to the database.
- Add logging from the frontend to the backend if the frontend ever receives a batch with duplicate tasks
- Fix a front-end issue that prevents duplicate examples from being shown to annotators, specifically in higher latency, production scenarios.
- Added a new Controller to facilitate annotation workflow customization.
- Added support for task routing, allowing you to customise who annotates each example.
- Added
annotations_per_task
setting to easily configure a task router for partial annotator overla - Added
allow_work_stealing
setting inprodigy.json
that allows you to turn off work stealing.
- Fix an issue where some unsaved examples could be lost during a browser refresh.
- Automatically prevent duplicates from appearing in training and evaluation set in
train
anddata-to-spacy
.
- Fix issue that could cause stream to repeat batches of questions in some scenarios.
- The
force_stream_order
config setting is now deprecated and the default behavior of the feeds. Batches are now always sent and re-sent in the same order wherever possible. - Fix issue that could cause next batch to be blocked when using
"instant_submit": true
.
- Warn after exhausting streams with many duplicates.
2. If you're missing data, do you have duplicates in your input (source)?
A common problem for new users is not realizing by default Prodigy's built-in recipes dedupe your source file. To dedup, Prodigy's built-in recipes automatically hash your source file.
When a new example comes in, Prodigy assigns it two hashes: the input hash and the task hash . Both hashes are integers, so they can be stored as JSON with each task. Based on those hashes, Prodigy is able to determine whether two examples are entirely different, different questions about the same input, e.g. text, or the same question about the same input.
For the docs for more details or these posts on how hashing works:
3. If you're missing annotations or have duplicates, did your annotators remember to save at the end of their session?
When you load a source file to Prodigy for annotation, Prodigy doesn't serve all the records at the same time. Instead, Prodigy will request batches of examples to annotate (by default, batch_size
is set to 10), and send back batches of answers in the background (or whenever you hit save). As you annotate, batches of examples will be cleared from your outbox and sent back to the Prodigy server to be saved in the database. You can also hit the save button in the top-left corner or press COMMAND+S (on Mac) or CTRL+S to clear your outbox and manually save.
What's challenging is sometimes you may have annotators who are confident they completed some annotations, however, those batches are not in your database. Unfortunately for many new annotators, it's very common for annotators to forget to click the Save button when they've finished their last batch. This causes issues as annotators are confident they did the annotations but you may not see those annotations in the database. This is where logging can also help as you can verify whether records were actually sent to the database or not.
If you're worried that your annotators may not remember to save their last batch, one alternative is to modify your configuration and set instant_submit: true
to instantly submit answers to the database. However, the downside is there are not any saved in the history so annotators can't go back. Like almost everything, there are trade-offs.
Alternatively, duplicates can happen too if annotators forget to save their last batch, leave their browser for a some time. Work stealing can then re-assign those apparent missed annotations to another annotator. Then the original annotator returns back to their browser and realizes they never saved their annotations, saves them, and now you have duplicates.
This is where logging can help immensely to keep track of exactly when annotations where made. For example, you can see when work stealing occurs in the logs if one session (let's call it steve
) steals the item with hash 13687435343 from another session, frank
:
07:34:18: POST: /get_session_questions
07:34:18: CONTROLLER: Getting batch of questions for session: steve
07:34:18: SESSION: steve has stolen item with hash 13687435343 from frank
4. If you have duplicates and using a custom recipe: are you using get_stream
, not JSONL
?
If you want to dedupe your source, avoid using JSONL
and opt for get_stream
when loading .jsonl
files.
As of Prodigy
1.12
the recommended way to load a source is via theget_stream
utility from the refactoredStream
component. The recommended way to preprocess the stream e.g. in order to add tokens or split sentences is via Stream’sapply
method.
JSONL
will load files without hashing and deduplication. get_stream
is used in built-in recipes and is how Prodigy's default hashing and deduplication is implemented.
It's important to note, especially older recipe examples (e.g., v1.11
or before 2021) may use JSONL
, so be especially aware of this if you're using older version of Prodigy or recipes.
5. Modifying Prodigy's configuration
Have you modified your prodigy.json
or any configuration overrides?
If so, make sure it’s doing what you intended and that your using the actual configuration you mean too (eg, global in your Prodigy Home
, locally in your folder, or overrides).
If you're iterating on multiple
prodigy.json
files or config overrides, it can be easy to accidentally run an incorrectprodigy.json
. Prodigy's logging can help as it will provide you details like whatprodigy.json
file you're using (12:58:16: CONFIG: Using config from global prodigy.json
). To use it, simply addPRODIGY_LOGGING=basic
for basic logging forPRODIGY_LOGGING=verbose
for detailed logging. For example,basic
will show you whether you're loading a global or localprodigy.json
whileverbose
will show the exact path:
13:01:05: CONFIG: Using config from global prodigy.json
13:01:05: /Users/ryan/.prodigy/prodigy.json
But logging can also show you whether you've filtered duplicates, overrides, and many other helpful items. If you report any issues, you'll get a much faster resolution by providing your logs.
Example:
As a global config: ~/.prodigy/prodigy.json
{
"feed_overlap": true
}
Or as an override
export PRODIGY_CONFIG_OVERRIDES='{"feed_overlap": true}'
First check your feed_overlap
, which configures how examples should be sent out across multiple sessions. If true
, each example in the dataset will be sent out once for each session , so you’ll end up with overlapping annotations (e.g. one per example per annotator). Setting "feed_overlap"
to false
will send out each example in the data once to whoever is available. As a result, your data will have each example labelled only once in total.
Here's a post that shows an example of how "feed_overlap": true
vs. "feed_overlap": false
(aka what "overlapping annotations" looks like):
Tip: When working with Prodigy config in VS Code, Vincent has a helpful trick to create a schema that enables autocomplete when modifying your
prodigy.json
. This can help reduce the change of typos (and headaches) in your config .
If you're using Prodigy v1.12.0 or newer, we introduce Task Routing that provides the ability to provide partial overlap and customize how to distribute the annotation workload among multiple annotators. What's very important is to make sure to read the sections on why it's important to specify PRODIGY_ALLOWED_SESSIONS
up front and also a use case of how you can still get duplicates, and a solution to prevent duplicates by checking the state in the database. Task Routing is considered a more advanced feature and we strongly recommend users carefully read the docs or watch Vincent's Task Routing tutorial video before using it in production or a large scale of annotations.
There are other important concepts to be aware and that can modify your annotation allocations like work stealing or setting annotations_per_task
, so be sure to read through the Task Routing docs carefully or here are a few helpful posts:
I still have missing or duplicate data: what should I do?
Create a reproducible example for us. The more we can reproduce on our end, the faster we can diagnose bugs and provide you workarounds and fixes.
This would include:
- Prodigy version: Show output for
prodigy stats
- The full prodigy command you're using like
prodigy ner.manual my_ner blank:en ./news_headlines.jsonl --label ORG, LOCATION
- If using a custom recipe, please provide it.
prodigy.json
or any overrides. Remember, yourprodigy.json
location can be found by runningprodigy stats
.- Prodigy logs.
verbose
logs are ideal, but also be wary they may include your sensitive data so please scrub them (or provide important snippets). - Sample data -- ideally in
.jsonl
format. We recommend this file (which we use internally): nyt_text_dedup.jsonl (18.5 KB)
This file is nice for debugging missing/duplicates for two reasons. First, it's a deduped version of our news_headlines.jsonl
file. There are exactly 176 records, each with a unique news headline.
Second, for each record, it includes as a meta key an index for each record, starting at 0 (we're Pythonistas of course ) up to 175. You can do this for any file -- which we highly recommend when debugging missing/duplicates -- just add a meta tag with the index like:
{"text": "The first sentence.", "meta": {"i": 0}}
{"text": "The second sentence.", "meta": {"i": 1}}
{"text": "The third sentence.", "meta": {"i": 2}}
...
By doing this - not only will you have a human-readable index for each record - but even annotators will see this index in their annotations. Therefore - annotators may see when a number is skipped or duplicated and can better report when/under what circumstances this is occurring.