Sub-label the existing labels

Hi, i used ner.manual to annotate a dataset from scratch with multiple labels. My question: is it possible to add new subtags for each label already annotated using the exesting pre-annotated dataset? For example i have: Label1, Label2 and Label3, and I want to add more sub-labels for each label.
Labeling Label1: Label1_1, Label1_2 and Label1_3 and do the same for the other existing labels.

Thank you.

Hi! By sub-label, do you mean, hierarchical categories? For example, if you have the label LOCATION, annotate whether the entity is LOCATION_CITY or LOCATION_COUNTRY, etc.? If so, one workflow could be to stream in your examples again with one entity at a time, and add multiple-choice options for the sub-labels. Then, all the annotator has to focus on is a single mention and a subset of sub-labels, so it should be really quick to annotate (and easy to evaluate, in case there are conflicts and disagreements).

To implement this, you could use a custom interface with two blocks: ner (to render the entity) and choice (for the options). The stream could look something like this:

options = [{"id": "LOCATION_CITY", "text": "LOCATION > CITY"}]  # etc.

def get_stream(stream):
    for eg in stream:
        for span in eg.get("spans", []):  # one example by annotated span
            yield {"text": eg["text"], "spans": [span], "options": options}

And then your blocks could look like this:

blocks = [
    {"view_id": "ner_manual"},
    {"view_id": "choice", "text": None}  # prevent text from being shown in both UIs
]
1 Like

Thank you for the reply. Is it possible to improve and train the new dataset after this step? and how to merge all datasets of each label in one?

Prodigy should be able to do this automatically when you train or run data-to-spacy, since all your examples have the same text, but different spans. When the data is merged, all annotations on the same text are merged into a single example.

Just make sure you use a new dataset for the sub-labels so there's no conflict (like, a span annotated with both LOCATION and CITY). Each token can only be part of one span.

In tihs case, you probably want to re-train from scratch – otherwise, you're trying to teach your model a completely new definition of what it previously predicted, which likely won't be very effective.