overwriting annotations

Hi,

To start off my annotations I plan to use ner.manual and then move to ner.teach, is this a feasible approach?

And, if I start annotating and decide to add an extra label when I restart the procedure I start at the beginning of the dataset again. Do I need to relabel the entities I previously annotated and if I don’t re-annotate do these now become ‘negatives’? I would like to start at the beginning again to utilise my full training set but I do not wish to have lost the labels I previously made.

I have a similar question for if I use many annotators, if one labels a word and another doesn’t, how does Prodigy handle this? And is it suitable to annotate the same piece of text many times?

Thanks.

Depending on your data, yes. The only thing that's important to keep in mind is that you'll need a lot of manually annotated examples to pre-train your model effectively. To speed up the process, you could also try using ner.teach with patterns that describe potential entity candidates (see our video tutorial for an example). This often works better than doing everything from scratch and you'll be able to collect more initial examples faster. You can also always use the manual interface again later on, to correct very specific entities or label examples manually that are difficult to express with patterns.

If you want to re-annotate data you've already worked on before, the best solution is to export your dataset and load it in as the data source. ner.manual respects pre-defined entities, so when you load up the app, you'll see what you've already annotated and can add new labels or edit the existing ones (if you've made a mistake etc.)

Prodigy's JSONL input and output formats are identical, so you can run db-out with the dataset name and use the exported JSONL file as the input source. For example:

prodigy db-out your_dataset > your_dataset.jsonl
prodigy ner.manual your_new_dataset en_core_web_sm your_dataset.jsonl --label SOME_LABEL

You do end up with two datasets this way – but this isn't necessarily bad, because it means you'll never overwrite any records and you can always go back to the previous dataset.

If you're working with multiple annotators, you probably want a separate dataset for each person. Every annotator can then work on the same text, and when they're done, you can export the JSONL data and reconcile it.

There's no easy answer for how you do this – reconciling annotations is tricky, and you'll have to decide how you want to handle mismatches, and what constitutes an examples that you want to have in your training corpus. It will help a lot if you can break your annotation decisions down into binary decisions (for example, by asking the annotators about pattern matches and not having them label everything by hand). Binary annotations are much easier to compare and you could then select the labels based on an agreement threshold – for example, only include annotations that at least 80% of your annotators agree on.

Btw, speaking of multiple annotators: We're currently working on an extension for Prodigy that lets you scale up your annotation projects, manage multiple annotators, reconcile their annotations and build larger datasets. We're hoping to accept the first beta testers this summer :blush:

3 Likes

Thank you for the detailed response Ines, it was really helpful. I think multi-annotator is something that will almost always be required for our projects so I look forward to hearing how the testing goes this summer.