Change some annotations for existing dataset


I already have a dataset with ner-annotations, which I imported using the db-in command.
Now I want to change some of these annotations, which I do by exporting the original dataset with the db-out command, filtering for the examples that I want to reannotate and feed them into the ner.manual recipe.
The resulting examples have the correct (revised) spans, but the same _task_hash as the original examples. This I fixed by including the following in the ner.manual recipe:

    def before_db(answers):
        if highlight_chars:
            answers = remove_tokens(answers)
        return [set_hashes(eg, overwrite=True) for eg in answers]

But what is the best way to combine these new annotations with the existing dataset? If I use the review recipe, it shows all of the examples but I'd like to review only those for which I changed the annotations. Also I don't really understand the review interface. What does accept/reject mean in this context and how do I create a "regular" dataset from the review dataset?
And db-merge would only concatenate the two datasets, correct?

So is this really the best workflow for this problem or is there something more straightforward? And what is the best way to do this merge? Do I need to customize the review recipe?


Hi! The idea of the review interface and workflow is basically to let you walk through all annotations again and create a "master annotation session" where you have the final say on every example (while being able to view all other answers – e.g. the different versions created in the manual interfaces or the accept/reject decisions made by different annotators). The data you create will have the same format as a regular Prodigy task, so you can export and train from it. It will also preserve the versions, so you can always reconstruct what lead to the decision.

The accept/reject mapping works just like it does in the regular annotation interfaces – for binary decisions, it's the answer, for manual decisions you can use "reject" to mark examples that are otherwise wrong (tokenization, broken markup etc).

One thing that maybe makes your workflow a bit special is that you already know what you want to re-annotate (which isn't always the case). This also means that you do need one extra process to filter out the examples you want to use, so Prodigy knows what to queue up.

A straightforward workflow for this could be:

  • Run your script and create two segments of the data: one to reannotate, one to keep.
  • Create a new dataset and import the annotations you want to keep.
  • Start up the server with the recipe you want to use (e.g. review if you want to review multiple annotations on the same data, or a recipe like ner.manual if you just want to stream in the data again). Save the annotations you collect to the previously created set.