So let's extend the example.
I'll be annotating this on-theme example:
{"text": "a wood chuck could chuck a lot of wood if a wood chuck could chuck wood"}
Again, I'll run:
PRODIGY_CONFIG_OVERRIDES='{"feed_overlap": true}' python -m prodigy textcat.manual issue-6044 examples.jsonl --label truthy
And now I'll annotate this with the guybrush
user. This user did not appear before. And for good measure, I'll show the annotation from db-out
:
> python -m prodigy db-out issue-6044 | grep guybrush
{"text":"a wood chuck could chuck a lot of wood if a wood chuck could chuck wood","_input_hash":-1690856185,"_task_hash":1885086500,"label":"truthy","_view_id":"classification","answer":"accept","_timestamp":1666878830,"_annotator_id":"issue-6044-guybrush","_session_id":"issue-6044-guybrush"}
Let's now see what happens when we review this item.
Without auto accept
prodigy review issue-6044-reviewed issue-6044
I don't make an annotation, but the interface does just show a single annotator just fine. Note that db-out
, as expected, doesn't have anything from Guybrush.
python -m prodigy db-out issue-6044-reviewed | grep guybrush
# EMPTY!
With auto accept.
prodigy review issue-6044-reviewed issue-6044 --auto-accept
It doens't show the annotation now!
But! Does it appear in the reviewed dataset automatically, like before?
python -m prodigy db-out issue-6044-reviewed | grep guybrush
# STILL EMPTY!
The example with "wood chucks" doesn't appear in db-out
because it's never been annotated by more than one person.
Back to Your Issue
It could be that there are hard duplicates in your data because the data got merged in a wrong way earlier. If that's the case, you might be able to alleviate the pain if you try out the --rehash
flag in the db-merge recipe and re-run.
Another thing you can consider is to just do some analysis in a Jupyter notebook. If you're savvy with Pandas, you should be able to load in the jsonl file via;
import pandas as pd
pd.read_json("path.jsonl", lines=True)
Alternatively, you might enjoy my clumper util library. It's a lot slower than pandas, but it's typically more expressive for nested lists of dictionaries.