We are review a dataset loaded in a prodigy database. The jsonl dataset used to load the dataset is correct but we discovered that some tokens are duplicated while using the review interface. Everything seems correct in the dataset used to load the database. Everything is also correct when we export the reviewed dataset despite the fact that we saw duplicated entities. (No duplication)
Is this a prodigy bug?
Please find in attachment an example of the case
investigate.jsonl (6.0 KB)
Hi @evince360,
Prodigy review
recipe expects the token id
to be an integer (not a string as is the case in investigate.jsonl
) and the indexation of tokens should start at 0
and not 1
.
The input validation process is a bit different for the review
recipe and it doesn't handle correctly the unexpected indexation. We'll definitely review this validation process and make sure review
handles the user provided tokenization similarly to relations
.
In the meantime, you should be able to update the to token indexation of tokens and spans via external Python script that takes away 1 from each id
and converts it to an integer and the same for token offsets in spans
.
investigate2.jsonl (3.6 KB)
Hello.
Thank you for your quick reply. We followed your suggestion but it seems things get worsed.
Did we made another mistake?
Thank you for your support
hmm.. this is odd. That's how investigate2.jsonl
renders for me in review
UI:
It looks correct to me? (in relations
UI it looks the same)
Could you share your exact command and the content of prodigy.json
?
Hello again,
I will hide some informations via xxx
Prodigy is started in our project via a linux service:
[Unit]
Description="xxx"
After=multi-user.target
[Service]
Type=simple
User=xxx
Group=xxx
Environment="http_proxy=xxx"
Environment="https_proxy=xxx"
Environment="PRODIGY_CONFIG=/opt/xxx/xxx/prodigy/xxx/prodigy.json"
Environment="PRODIGY_PORT=8100"
ExecStart=/opt/xxx/xxx/prodigy-env/bin/python3.8 -m prodigy review review_db ai_annotations --view-id relations --label rindirectdirect,rnomme
[Install]
WantedBy=multi-user.target
And here the content of the prodigy.json file :
"theme": "basic",
"custom_theme": {"cardMaxWidth":"95%", "smallText":16},
"buttons": ["accept", "reject", "ignore", "undo"],
"batch_size": 10,
"history_size": 10,
"port": 8100,
"host": "0.0.0.0",
"cors": true,
"db": "sqlite",
"db_settings": {
"sqlite": {
"name": "prodigy.db",
"path": "/opt/xxx/xxx/prodigy"
}
},
"validate": false,
"auto_exclude_current": true,
"instant_submit": false,
"feed_overlap": false,
"annotations_per_task": null,
"allow_work_stealing": true,
"total_examples_target": 0,
"ui_lang": "en",
"project_info": ["dataset", "session", "lang", "recipe_name", "view_id", "label"],
"show_stats": false,
"hide_meta": false,
"show_flag": false,
"instructions": false,
"swipe": false,
"swipe_gestures": { "left": "accept", "right": "reject" },
"split_sents_threshold": false,
"html_template": false,
"global_css": null,
"javascript": null,
"writing_dir": "ltr",
"show_whitespace": false,
"exclude_by": "task",
"relations_span_labels": ["EU", "NATL", "INTL", "NOMME", "JO/OTHERS"]
}
Hi @evince360 ,
There's nothing about the config that should affect the way the data is rendered. Could you also share the output of python -m prodigy stats
command? Thank you.
Finally, I need to ask - are you 100% sure the investigate2.jsonl
results in incorrect view? It really does render correct for me - at least with the current Prodigy version.
Hello,
I create a new database (prodigy db-in) but with the --rehash switch and I launch again the review project via the linux service. I did one more change before launching it again :
Previously it was this :
ExecStart=/opt/xxx/xxx/prodigy-env/bin/python3.8 -m prodigy review review_db ai_annotations --view-id relations --label rindirectdirect,rci-apres**
Now: ( and this is what I sent to you)
ExecStart=/opt/xxx/xxx/prodigy-env/bin/python3.8 -m prodigy review review_db ai_annotations --view-id relations --label rindirectdirect,rnomme
I don't know what solve the issue but there is no more duplications
Thank you a lot
Hi @evince360,
I'm not 100% sure what happened. I can only tell that changing the label set would affect the task hash being computed but it has nothing to do with the tokenization, which was the problem here.
In any case the output you sent me (investigate2.jsonl
) definitely rendered for me correctly so perhaps you weren't looking at the dataset with the corrected token ids? Not really sure. I'm glad it renders fine for you now.
Let us know if you come across any issues going forward.