Combine NER and doc classification in annotation process

Hello,

I am going to annotate approximately 10 000 documents, where I should select NER and labels of the document by several options. I understand, that I can split this task into two for NER and txt classification, but in this case, I need to read the document twice (it take some time to understand the content of this document ). So, the main question is it possible to annotate the document for NER and text classification at once (as heartext does it see example bellow ) ?

Best regards,
Dmytro

Hi! That's definitely possible using a combined interface with a ner_manual and a choice block. See here for an example: https://prodi.gy/docs/custom-interfaces#blocks This one actually shows exactly what you're looking for (with an added free-form input in this case).

If your goal is to just label all individual examples as they come in, doing NER and text classification annotations jointly definitely make sense. If you want to use a more semi-automated approach and are looking to pre-select the best examples to annotate, it would probably make sense to separate the two tasks because the optimal examples you may want to select for text classification vs. NER are likely different. So you may want to train both components on different subsets of the data selected based on what works best for you model.

Hello @ines,

For a given use case, I would really need to do both ner.correct and textcat.manual but I'm not able to make it work. Given complexity of document, I would rather prefer that experts only have 1 iteration across documents.

prodigy ner.correctwithtextclassificationmanual news_headlines_demo_um en_core_web_lg ./input_prodigy/demo_um/news_headlines.jsonl --label PERSON,ORG,PRODUCT,LOCATION --labeltextcat OK,NOT_OK -F ner_correctwithtextclassificationmanual.py

By the way, my text classifcation task is mutually exclusive. I don't where I can define it in my command line and in my custom script. Can you give some some guidance there?

It start as expected:

But struggle to display the task in the browser:

I have carefully read documentation about custom recipes and 'blocks' but I'm not able to make it work. I have tried multiple locations where I put label and labeltextcat but nothing works as I would like
I need 2 group of labels.

  • label: for all the entities for my NER task (1st block)
  • labeltextcat: categories of my textcat task (2nd block)

But I struggle where/how to use them in my custom function. Here you can find my python code regarding the custom recipe: do you have an idea what I need to change in my code?


import prodigy
from prodigy.components.loaders import JSONL
from prodigy.components.preprocess import add_tokens
from prodigy.util import split_string, set_hashes
from prodigy.util import get_labels
import spacy
import copy
from typing import List, Optional


def make_tasks(nlp, stream, labels):
    """Add a 'spans' key to each example, with predicted entities."""
    # Process the stream using spaCy's nlp.pipe, which yields doc objects.
    # If as_tuples=True is set, you can pass in (text, context) tuples.
    texts = ((eg["text"], eg) for eg in stream)
    for doc, eg in nlp.pipe(texts, as_tuples=True):
        task = copy.deepcopy(eg)
        spans = []
        for ent in doc.ents:
            # Continue if predicted entity is not selected in labels
            if labels and ent.label_ not in labels:
                continue
            # Create span dict for the predicted entitiy
            spans.append(
                {
                    "token_start": ent.start,
                    "token_end": ent.end - 1,
                    "start": ent.start_char,
                    "end": ent.end_char,
                    "text": ent.text,
                    "label": ent.label_,
                }
            )
        task["spans"] = spans
        # Rehash the newly created task so that hashes reflect added data
        task = set_hashes(task)
        yield task


# Recipe decorator with argument annotations: (description, argument type,
# shortcut, type / converter function called on value before it's passed to
# the function). Descriptions are also shown when typing --help.
@prodigy.recipe(
    "ner.correctwithtextclassificationmanual",
    dataset=("The dataset to use", "positional", None, str),
    spacy_model=("The base model", "positional", None, str),
    source=("The source data as a JSONL file", "positional", None, str),
    label=("One or more comma-separated labels", "option", "l", get_labels),
    labeltextcat=("One or more comma-separated labels for text classficiation", "option", "ltextcat", get_labels),
    exclude=("Names of datasets to exclude", "option", "e", split_string),
)
def ner_correctwithtextclassificationmanual(
    dataset: str,
    spacy_model: str,
    source: str,
    label: Optional[List[str]] = None,
    labeltextcat: Optional[List[str]] = None,
    exclude: Optional[List[str]] = None,
):
    """
    Create gold-standard data by correcting a model's predictions manually.
    """
    # Load the spaCy model
    nlp = spacy.load(spacy_model)

    # Load the stream from a JSONL file and return a generator that yields a
    # dictionary for each example in the data.
    stream = JSONL(source)

    # Tokenize the incoming examples and add a "tokens" property to each
    # example. Also handles pre-defined selected spans. Tokenization allows
    # faster highlighting, because the selection can "snap" to token boundaries.
    stream = add_tokens(nlp, stream)

    # Add the entities predicted by the model to the tasks in the stream
    stream = make_tasks(nlp, stream, label)

    return {
        "view_id": "blocks",  # Annotation interface to use
        "dataset": dataset,  # Name of dataset to save annotations
        "stream": stream,  # Incoming stream of examples
        "exclude": exclude,  # List of dataset names to exclude
        "config": {  # Additional config settings, mostly for app UI
            "lang": nlp.lang,
            "labels": label,  
            "blocks": [
                {"view_id": "ner_manual"},
                {"view_id": "choice"}
            ]
        }
    }

Thanks a lot already for your time and patience :slight_smile:
Regards,
Cédric

It looks like you're not adding any "options" to use for the choice interface? So in addition to adding your label labels, I think you also want to add your labeltextcat labels to each example in the stream as "options". Here's an example:

You can also find more details on what's expected for the choice UI here: https://prodi.gy/docs/api-interfaces#choice

1 Like

Thanks a lot! It works fine.

Given we can have quite some long texts, having 2 blocks (1 ner_manual+ 1 choice) implies that my "text" tag is displayed twice on the page (see below).

Is there a way to display the content of "text" only 1 for ner_manual
Now I have this in my return of function but I don't see where/how i can specify a different stream (1 with 'text' tag, the other one without 'text' tag):

    return {
        "view_id": "blocks",  # Annotation interface to use
        "dataset": dataset,  # Name of dataset to save annotations
        "stream": stream,  # Incoming stream of examples
        "exclude": exclude,  # List of dataset names to exclude
        "config": {  # Additional config settings, mostly for app UI
            "lang": nlp.lang,
            "labels": label,  
            "blocks": [
                {"view_id": "ner_manual"},
                {"view_id": "choice"}
            ]
        }
    }

Thank you.
Cédric

Yes, you can overwrite certain task properties in the blocks, e.g. set "text": None in the choice block to not make it display the text. See here for an example: https://prodi.gy/docs/custom-interfaces#blocks