"evolving" an annotation dataset by adding labels?

Craig · October 28, 2023, 10:02pm

Just got prodigy and am experimenting creating NER annotation of data. At first I did ner.manual with --label PER,LOC,ORG, but after completing that and training a simple model I decided I really wanted to have a label "EVT". I poked around to see if there's some way to take my original dataset and "evolve" it and add "EVT", but I can't see a way. I tried things like rerunning ner.manual with all four labels on the original but it says there are no tasks. I tried creating two dataset and "db-merge" or "db-in" on them but that didn't really seem to work (but then I'm still new to this).

I'm early enough into the process that I can just reannotate from the beginning, but I can foresee cases in the future where be might want to add annotations to an existing dataset or even maybe create a couple datasets with a few related labels and then "merge" them together in various ways for specific application domains.

Guidance appreciated.

koaning · October 30, 2023, 10:01am

Hi Craig!

Here's an approach that might work. Let's say that I start with this dataset.

{"text": "hello my name is james"}
{"text": "hello my name is john"}
{"text": "hello my name is robert"}
{"text": "hello my name is michael"}
{"text": "hello my name is william"}
{"text": "hello my name is mary"}
{"text": "hello my name is david"}
{"text": "hello my name is richard"}
{"text": "hello my name is joseph"}

I'll annotate it via this recipe call.

python -m prodigy ner.manual issue-6870 blank:en examples.jsonl --label name

This allows me to make annotations like this one.

But now let's say that I annotate all 10 of these examples and that I later decide to add another label. Then I could do this:

python -m prodigy ner.manual issue-6870 blank:en examples.jsonl --label name,greeting

But this will indeed show you this:

Prodigy internally uses a hashing mechanism to ensure that you don't annotate duplicate items. This is a great feature, but in this case you'd like to re-do the examples.

One method for this is to just restart. You'd refer to the annotations that already exist, but you'd create a new dataset.

python -m prodigy ner.manual issue-6870-new blank:en dataset:issue-6870 --label name,greeting

This would work, because we're creating a new dataset called issue-6870-new which has zero annotations. Note that we're using the dataset:-prefix in dataset:issue-6870 to tell Prodigy that this is indeed a dataset that we're referring to, not a file on disk.

When you run this, your interface will look like this.

That means that you still have your old annotations around, but you can annotate the new one on top.

CleanShot 2023-10-30 at 11.01.11

Let me know if this helps!

Craig · October 30, 2023, 9:04pm

That helps! I had not seen that I could use dataset:X as the input to the labeling process. That will certainly come in handy as we need to add/change/delete labeling.

Thanks.

Topic		Replies	Views
Adding new label usage , ner	5	1337	November 8, 2021
overwriting annotations ner	2	1243	May 28, 2018
Adding labels in ner.batch-train enhancement , usage , ner , done	3	986	February 20, 2018
NER - Add labels on the fly usage , ner	1	460	May 8, 2021
Sub-label the existing labels usage , ner , solved	3	687	October 22, 2020

"evolving" an annotation dataset by adding labels?

Related topics