"evolving" an annotation dataset by adding labels?

Just got prodigy and am experimenting creating NER annotation of data. At first I did ner.manual with --label PER,LOC,ORG, but after completing that and training a simple model I decided I really wanted to have a label "EVT". I poked around to see if there's some way to take my original dataset and "evolve" it and add "EVT", but I can't see a way. I tried things like rerunning ner.manual with all four labels on the original but it says there are no tasks. I tried creating two dataset and "db-merge" or "db-in" on them but that didn't really seem to work (but then I'm still new to this).

I'm early enough into the process that I can just reannotate from the beginning, but I can foresee cases in the future where be might want to add annotations to an existing dataset or even maybe create a couple datasets with a few related labels and then "merge" them together in various ways for specific application domains.

Guidance appreciated.

1 Like

Hi Craig!

Here's an approach that might work. Let's say that I start with this dataset.

{"text": "hello my name is james"}
{"text": "hello my name is john"}
{"text": "hello my name is robert"}
{"text": "hello my name is michael"}
{"text": "hello my name is william"}
{"text": "hello my name is mary"}
{"text": "hello my name is david"}
{"text": "hello my name is richard"}
{"text": "hello my name is joseph"}

I'll annotate it via this recipe call.

python -m prodigy ner.manual issue-6870 blank:en examples.jsonl --label name

This allows me to make annotations like this one.

But now let's say that I annotate all 10 of these examples and that I later decide to add another label. Then I could do this:

python -m prodigy ner.manual issue-6870 blank:en examples.jsonl --label name,greeting

But this will indeed show you this:

Prodigy internally uses a hashing mechanism to ensure that you don't annotate duplicate items. This is a great feature, but in this case you'd like to re-do the examples.

One method for this is to just restart. You'd refer to the annotations that already exist, but you'd create a new dataset.

python -m prodigy ner.manual issue-6870-new blank:en dataset:issue-6870 --label name,greeting 

This would work, because we're creating a new dataset called issue-6870-new which has zero annotations. Note that we're using the dataset:-prefix in dataset:issue-6870 to tell Prodigy that this is indeed a dataset that we're referring to, not a file on disk.

When you run this, your interface will look like this.

That means that you still have your old annotations around, but you can annotate the new one on top.

CleanShot 2023-10-30 at 11.01.11

Let me know if this helps!

2 Likes

That helps! I had not seen that I could use dataset:X as the input to the labeling process. That will certainly come in handy as we need to add/change/delete labeling.

Thanks.

1 Like