duplicated ner token in the review recipe

evince360 · October 8, 2024, 5:24am

We are review a dataset loaded in a prodigy database. The jsonl dataset used to load the dataset is correct but we discovered that some tokens are duplicated while using the review interface. Everything seems correct in the dataset used to load the database. Everything is also correct when we export the reviewed dataset despite the fact that we saw duplicated entities. (No duplication)
Is this a prodigy bug?
Please find in attachment an example of the case
investigate.jsonl (6.0 KB)

magdaaniol · October 8, 2024, 10:52am

Hi @evince360,

Prodigy review recipe expects the token id to be an integer (not a string as is the case in investigate.jsonl) and the indexation of tokens should start at 0 and not 1.
The input validation process is a bit different for the review recipe and it doesn't handle correctly the unexpected indexation. We'll definitely review this validation process and make sure review handles the user provided tokenization similarly to relations.
In the meantime, you should be able to update the to token indexation of tokens and spans via external Python script that takes away 1 from each id and converts it to an integer and the same for token offsets in spans.

evince360 · October 9, 2024, 2:30pm

investigate2.jsonl (3.6 KB)
Hello.
Thank you for your quick reply. We followed your suggestion but it seems things get worsed.
Did we made another mistake?
Thank you for your support

magdaaniol · October 10, 2024, 2:56pm

hmm.. this is odd. That's how investigate2.jsonl renders for me in review UI:

It looks correct to me? (in relations UI it looks the same)
Could you share your exact command and the content of prodigy.json?

evince360 · October 11, 2024, 8:22am

Hello again,

I will hide some informations via xxx

Prodigy is started in our project via a linux service:

[Unit]
Description="xxx"
After=multi-user.target
[Service]
Type=simple
User=xxx
Group=xxx
Environment="http_proxy=xxx"
Environment="https_proxy=xxx"
Environment="PRODIGY_CONFIG=/opt/xxx/xxx/prodigy/xxx/prodigy.json"
Environment="PRODIGY_PORT=8100"
ExecStart=/opt/xxx/xxx/prodigy-env/bin/python3.8 -m prodigy review review_db ai_annotations --view-id relations --label rindirectdirect,rnomme
[Install]
WantedBy=multi-user.target

And here the content of the prodigy.json file :

  "theme": "basic",
  "custom_theme": {"cardMaxWidth":"95%", "smallText":16},
  "buttons": ["accept", "reject", "ignore", "undo"],
  "batch_size": 10,
  "history_size": 10,
  "port": 8100,
  "host": "0.0.0.0",
  "cors": true,
  "db": "sqlite",
  "db_settings": {
    "sqlite": {
      "name": "prodigy.db",
      "path": "/opt/xxx/xxx/prodigy"
    }
  },
  "validate": false,
  "auto_exclude_current": true,
  "instant_submit": false,
  "feed_overlap": false,
  "annotations_per_task": null,
  "allow_work_stealing": true,
  "total_examples_target": 0,
  "ui_lang": "en",
  "project_info": ["dataset", "session", "lang", "recipe_name", "view_id", "label"],
  "show_stats": false,
  "hide_meta": false,
  "show_flag": false,
  "instructions": false,
  "swipe": false,
  "swipe_gestures": { "left": "accept", "right": "reject" },
  "split_sents_threshold": false,
  "html_template": false,
  "global_css": null,
  "javascript": null,
  "writing_dir": "ltr",
  "show_whitespace": false,
  "exclude_by": "task",
  "relations_span_labels": ["EU", "NATL", "INTL", "NOMME", "JO/OTHERS"]
}

magdaaniol · October 11, 2024, 8:49am

Hi @evince360 ,

There's nothing about the config that should affect the way the data is rendered. Could you also share the output of python -m prodigy stats command? Thank you.
Finally, I need to ask - are you 100% sure the investigate2.jsonl results in incorrect view? It really does render correct for me - at least with the current Prodigy version.

evince360 · October 11, 2024, 12:17pm

Hello,
I create a new database (prodigy db-in) but with the --rehash switch and I launch again the review project via the linux service. I did one more change before launching it again :

Previously it was this :
ExecStart=/opt/xxx/xxx/prodigy-env/bin/python3.8 -m prodigy review review_db ai_annotations --view-id relations --label rindirectdirect,rci-apres**

Now: ( and this is what I sent to you)
ExecStart=/opt/xxx/xxx/prodigy-env/bin/python3.8 -m prodigy review review_db ai_annotations --view-id relations --label rindirectdirect,rnomme

I don't know what solve the issue but there is no more duplications

Thank you a lot

magdaaniol · October 14, 2024, 1:11pm

Hi @evince360,

I'm not 100% sure what happened. I can only tell that changing the label set would affect the task hash being computed but it has nothing to do with the tokenization, which was the problem here.

In any case the output you sent me (investigate2.jsonl) definitely rendered for me correctly so perhaps you weren't looking at the dataset with the corrected token ids? Not really sure. I'm glad it renders fine for you now.
Let us know if you come across any issues going forward.

Topic		Replies	Views
revising annotation by prodigy--here only one label (DATE) usage , ner , solved	16	1932	May 20, 2019
Token boundary bug in web interface ner , front-end	3	400	July 22, 2020
Inconsistency in "token_end" in prodigy/spacy entities ner , spacy	2	623	March 26, 2019
Cannot use the review recipe ner , front-end , solved , review	5	538	October 29, 2020
Prodigy tokenizing even when not supposed to? ner , done	1	544	August 16, 2019

duplicated ner token in the review recipe

Related topics