Renaming labels in NER

Hi,

I annotated a large amount of documents with very granular labels. Now, I want to merge some very granular labels in its "parent" label. Is there a way to edit label names in prodigy? I thought of a way to do this (stated below), but wanted to know if it can be done in Prodigy.

Method (prodigy->python->prodigy): Get the output of the database, edit "spans" in the output jsonl file via python. Read the jsonl as a new database into Prodigy.

Any help or recommendation is welcomed! Thanks!

Hi! That sounds like a reasonable approach to me :slightly_smiling_face: Datasets in Prodigy are append-only by design, so you'd probably want to create a new dataset for your edited annotations. And if it turns out you want to go back to the previous state, you still have the old dataset available.

(If you're a jq wizard, you could probably write a command-line one-liner to do the data transformation and then pipe the result forward to a new dataset, all in one step. But I couldn't tell you how :sweat_smile:)

1 Like

@ines there might be a bug in this process.

Here was my workflow:

  1. prodigy db-out dataset_granular_labels > granular_labels.jsonl
  2. In python, change "label" value to the less granular value for all annotations under "spans".
  3. In python, save it as less_granular.jsonl
    • Checked whether less_granular.jsonl contained the changed labels and it does.
  4. prodigy db-in dataset_lessgranular less_granular.jsonl

However when I do prodigy review dataset_lessgranular_reviewed dataset_lessgranular --label labels_less_granular.txt, I see the less granular labels (what I changed them to) on the top in the label choice section, but the existing labels (highlighted yellow in the text) are still the old ones.

Any recommendation would be appreciated!

============================== ✨  Prodigy Stats ==============================

Version          1.9.9
Platform         Windows-10-10.0.18362-SP0
Python Version   3.6.2
Database Name    SQLite
Database Id      sqlite

Hmm, there's very little magic going on here and the review recipe should just show you whatever is in that datast :thinking: When you look at what's in your dataset_lessgranular dataset (e.g. using db-out), which labels do you see here and how many examples are in there? Maybe it somehow ended up with a copy of the previous unedited data?

  • less_granular.jsonl: has the edited (less granular) labels. This is the altered db-out via python script.
  • After db-ining less_granular.jsonl, I db-outed it again as less_granular_prodigy.jsonl. I see the less granular version of the labels here too (under the main "spans"). However, under "versions", I see other "spans", that contain the granular labels. To visualize for one document:
{"text" : "..."
...
"tokens": [...]
"spans": [ LESS GRANULAR LABELS SEEN HERE ] 
"versions": [
       {"text" : "..."
         ...
         "tokens": [...]
         "spans": [ GRANULAR LABELS SEEN HERE ] 
         "versions": [
                {"text" : "..."
                  ...
                  "tokens": [...]
                  "spans": [ GRANULAR LABELS SEEN HERE TOO ] 
                }
       }
]
}

Late to the party, but I've successfully changed labels just using sed.

sed 's,"label":"OLD_LABEL","label":"NEW_LABEL",g' old_task.jsonl > new_task.jsonl
1 Like