Anotation task format for ner_manual interface

Hi! I would like to extract entities from a text and change entities boundaries manually when visualizing them in Prodigy. As I don’t need a model for the moment, I am using the mark recipe but, in order to change them manually, I would need to use the ner_manual interface and, as far as I understood, I would need to have the following format for my entities:

{
"text": "Hello Apple",
"tokens": [
{"text": "Hello", "start": 0, "end": 5, "id": 0},
{"text": "Apple", "start": 6, "end": 11, "id": 1}
],
"spans": [
{"start": 6, "end": 11, "label": "ORG", "token_start": 1, "token_end": 1}
]
}

But I have only the spans and not the tokens so I can only use the ner interface which doesn’t allow me to change the boundaries for the entities manually. This is the format that I have:

{
"text": "Apple updates its analytics service with new metrics",
"spans": [
{"start": 0, "end": 5, "label": "ORG"}
]
}

I would like to know how can I get the tokens for the entities and then been able to use the ner_manual interface. Or maybe there is another option for changing manually the boundaries of the entities that I could consider.

Thanks a lot!

Maria

Hi! What happens if you just load in your data the way it is? If no "tokens" are present, Prodigy will use the model’s tokenizer to generate them automatically for you :slightly_smiling_face:

The only thing that’s important is that your entity spans align with the token boundaries – for instance, “Apple” is fine, because that’ll be one token. But the span “Apple up” wouldn’t because “up” isn’t a standalone token. If the pre-annotated spans don’t align, Prodigy will raise an error and show you the mismatch, so you can fix it. It usually shouldn’t happen, though – unless your data is very noisy.

Hi! In fact that is the problem: my entity spans don’t align with the token boundaries (sorry that I didn’t mention it before) and this is the error that I am having if I use ner_manual

ERROR: Invalid task format for view ID 'ner_manual'
'tokens' is a required property
'token_start' is a required property [spans -> 0]
'token_end' is a required property [spans -> 0]
'token_start' is a required property [spans -> 1]
'token_end' is a required property [spans -> 1]

Thanks!

1 Like

Ah, okay, that makes sense then. Are you just using mark? I think I misread your initial question and thought you were using the built-in ner.manual recipe, which does take care of the tokenization automatically.

If you need your own custom tokens that align with your entity spans, then you also need to provide them. It might be worth writing a little script to check how many of the spans do not align – maybe it’s just one or two that you can easily correct manually (or exclude from your data).

An easy way to do this is to use spaCy’s Doc.char_span method, which creates a token span from character offsets. If the character offsets don’t align to the tokens, it returns None. So you can do something like this:

nlp = spacy.load("en_core_web_sm")  # or other model

for example in examples:  # your existing examples
    doc = nlp(example["text"])
    for span in example["spans"]:
        char_span = doc.char_span(span["start"], span["end"])
        if char_span is None:  # start and end don't map to tokens
            print("Misaligned tokens", example["text"], span)
1 Like

Thanks so much for your quick reply! I will try it out.

1 Like

It works! I have excluded the spans that don’t align and I have provided the tokens for the ones that do and I don’t have the error anymore. Thanks!

There is only one thing that I don’t understand. If I add a LABEL to the mark recipe like this:

prodigy mark name_dataset json_file --view-id ner_manual --label LABEL

The label that Prodigy is providing me on the top says “NO_LABEL”, even if I had specify it with --label LABEL

If I use instead the ner.manual recipe like this:

prodigy ner.manual name_dataset name_model json_file --label LABEL

then the correct label appears.

Am I missing anything regarding the LABEL for the mark recipe?

Thanks again!

Yay, glad it worked! :tada:

Sorry that this a little confusing. The --label setting in the mark recipe really only adds a top-level "label" key to the tasks. For manual annotation, you want to provide a "label set" for the entire session. This is usually done by adding a list of "labels" to the "config" returned by the recipe.

The mark recipe is pretty agnostic to what you're annotating. It just streams whatever comes in and renders it. So it currently doesn't have any special case rules that add the labels to the config if it's a manual recipe – that's really what the more task-specific recipes like ner.manual do.

(You can also take a look at the source of the recipe files included in Prodigy – or check out the slightly simplified versions with comments in our prodigy-recipes repo. There you can see what the different recipes do under the hood, and how they use the arguments you pass in on the command line.)

I will have a look at it. I am starting with Prodigy and I am still at the beginning of the exploration of the different possibilities the tool can offer.

What is sure is that you are helping me a lot! It is much appreciated!

1 Like