overwriting annotations

Depending on your data, yes. The only thing that's important to keep in mind is that you'll need a lot of manually annotated examples to pre-train your model effectively. To speed up the process, you could also try using ner.teach with patterns that describe potential entity candidates (see our video tutorial for an example). This often works better than doing everything from scratch and you'll be able to collect more initial examples faster. You can also always use the manual interface again later on, to correct very specific entities or label examples manually that are difficult to express with patterns.

If you want to re-annotate data you've already worked on before, the best solution is to export your dataset and load it in as the data source. ner.manual respects pre-defined entities, so when you load up the app, you'll see what you've already annotated and can add new labels or edit the existing ones (if you've made a mistake etc.)

Prodigy's JSONL input and output formats are identical, so you can run db-out with the dataset name and use the exported JSONL file as the input source. For example:

prodigy db-out your_dataset > your_dataset.jsonl
prodigy ner.manual your_new_dataset en_core_web_sm your_dataset.jsonl --label SOME_LABEL

You do end up with two datasets this way – but this isn't necessarily bad, because it means you'll never overwrite any records and you can always go back to the previous dataset.

If you're working with multiple annotators, you probably want a separate dataset for each person. Every annotator can then work on the same text, and when they're done, you can export the JSONL data and reconcile it.

There's no easy answer for how you do this – reconciling annotations is tricky, and you'll have to decide how you want to handle mismatches, and what constitutes an examples that you want to have in your training corpus. It will help a lot if you can break your annotation decisions down into binary decisions (for example, by asking the annotators about pattern matches and not having them label everything by hand). Binary annotations are much easier to compare and you could then select the labels based on an agreement threshold – for example, only include annotations that at least 80% of your annotators agree on.

Btw, speaking of multiple annotators: We're currently working on an extension for Prodigy that lets you scale up your annotation projects, manage multiple annotators, reconcile their annotations and build larger datasets. We're hoping to accept the first beta testers this summer :blush:

3 Likes