✨ Demo: fully manual NER annotation interface

I hope it works🤞 Sorry if this is still a little hard at the moment – as I said, that part is experimental. We're currently working on adding the required functions for this to the Prodigy core library. There'll be a helper that reconstructs the span-token indices, and the built-in NER recipes will also make sure to add them to the spans by default.

This means we could also add a --manual or --edit flag to the active-learning-powered recipes like ner.teach. So you could just do ner.teach dataset en_core_web_sm my_data.jsonl --manual and it'd show you the recognised spans, but make the task editable :tada:

Yes, this makes sense! I haven't tested it, but it sounds reasonable. Btw, here's a simple split_tokens function you can use to add the "tokens" key yourself:

def split_tokens(nlp, stream):
    for eg in stream:
        doc = nlp(eg['text'])
        eg['tokens'] = [{'text': token.text, 'start': token.idx,
                         'end': token.idx + len(token.text), 'id': i}
                        for i, token in enumerate(doc)]
        yield eg

Now your pre-annotated spans only need a startIdx and endIdx (the naming is bad, this will be changed in the next update!). So a span describing token 0 to token 4 will need "startIdx": 0, "endIdx": 5.

1 Like

@ines awesome thanks. Can’t wait to try it out

Good news btw – just got it working locally :tada: :tada: So we’ll definitely ship this with the next release. Including the --manual flag on ner.teach!

Edit: Or, probably a better solution to not interfere with the active learning component and prefer_uncertain on ner.teach: This workflow could replace the current ner.make-gold. So instead of seeing the most uncertain prediction, you see the best parse and can correct it. (Ha, you’re really getting live insights into the Prodigy development process here, haha.)

4 Likes

Yes! Enabling a zero-click interaction context (all keyboard) would save my liiiiiife :smiley:

Also character-level selection, though that can certainly be a mode.

Do you plan to enable relation annotation at any point?

1 Like

FWIW, the documentation seems not to indicate that this feature is not yet supported. I try -v ner_manual on a mark recipe, and it looks like it’s working wonderfully until I see that I have the dreaded “no tasks” screen. :scream:

Ah yeah, it's kinda lumped it with the ner interface at the moment – I was thinking about writing a simple demo script that mimics the highlighting behaviour, though, to showcase it better (similar to the demo posted in the first post of this thread).

The problem with using mark and the ner_manual interface is that Prodigy needs a model or at least a tokenizer to split the text into tokens. So you'll either need to feed in tasks that already have a "tokens" property set (see the example data I posted above), or just use ner.manual instead.

Yesss, we'd love to make this happen – but it's pretty difficult to get right. And if we do it, we want it to be actually good and useful. The "boundaries" interface sort of went in that direction, but it came with all kinds of other problems. But we'll keep experimenting.

Either that, or you could add your own tokenization rules, for example, if you need to handle certain characters or punctuation differently. It might take you 20 minutes to write a few regular expressions and add them to spaCy's tokenizer – but that's still a lot more efficient overall than adding 5 more seconds to each individual annotation decision.

Yes, that's definitely on the roadmap. Our current idea is to use a simplified, displaCy-style interface and a workflow similar to NER annotation. Edit: Forgot to add – in the meantime, here are some ideas and strategies for how to make dependency / relation annotation work with the current interfaces.

1 Like

Is there planned support for newlines in this interface? I’d expect them to just work, since white-space is set to pre-wrap, but something about the spans being inline-block seems to be preventing it.

I’ve copied your demo here https://codepen.io/erikwiffin/pen/EoLogq but I added newlines to the raw text. As you can see, they aren’t being rendered in the interface.

@erikwiffin Ah, my demo is a bit rudimentary and doesn’t necessarily reflect the actual rendering of Prodigy. However, I just tested it in the interface and you’re right – because the newline is enclosed in an inline-block element, it doesn’t cause the surrounding inline-block elements to reflow and instead, just stretches the token container (which makes sense).

So thanks for bringing that up! I just tested it briefly and I think it should be fine to keep the tokens as regular inline elements.

Because ner.manual pre-tokenizes the text using spaCy, newlines will only ever occur if there are multiple of them (which spaCy will preserve as individual tokens, to always keep a reference to the original text). But in cases like this, they’re especially relevant. They also sometimes throw off the model’s predictions, which causes \n tokens to be labelled as entities. So we definitely don’t want to swallow them. (In the worst case scenario, annotators might even accidentally select newline tokens as part of entities and fuck up the model in very confusing ways.)

One solution could be to port over the whitespace indicators I implemented as an experimental feature for the latest version – see this thread for details. You can already test this by setting "show_whitespace": true in your prodigy.json, and running the regular ner interface.

Edit: I ran a few tests and it turns out that inline-block tokens are necessary in order to allow cross-browser compatible selection by double-clicking a token. Block-scoping the token constrains the highlighting to the element boundaries. As a solution for now, I’m simply replacing \n and \t with and (during rendering only).

26

Styling those elements is a little tricky, because it makes the highlighting logic more difficult. So the visual output is currently ambiguous if the input text contains the unicode characters or (which should still be a lot less common than \n or \t, so it might be decent compromise for now).

Wow this is amazing and really needed! Thank you!

1 Like

I've built spans just like you proposed for tasks using spacy's PatternMatcher which returns multiple matches if available but ner_manual view seems a little off - a token may be. Also firsts task goes unannotated, I've to ignore it to get new task which is annotated.

Also while building spans, I had to normalize overlapping spans and drop smaller spans in favor of longer ones so manual view gets as cleaned up data as possible.

I could not use ner.teach recipe because prodigy's PatternMatcher returns only one match as per my understanding. I like to use the knowledge of patterns and model predictions to be used in manual view.

Here is how my recipe looks like. Would really appreciate some quick help.

@prodigy.recipe('ner.semi-manual',
        dataset=prodigy.recipe_args['dataset'],
        spacy_model=prodigy.recipe_args['spacy_model'],
        source=prodigy.recipe_args['source'],
        api=prodigy.recipe_args['api'],
        loader=prodigy.recipe_args['loader'],
        label=prodigy.recipe_args['label'],
        patterns=prodigy.recipe_args['patterns'],
        exclude=prodigy.recipe_args['exclude'])
def manual(dataset, spacy_model, source=None, api=None, loader=None,
           label=None, patterns=None, exclude=None):
    """
    Mark spans by token. Requires only a tokenizer and no entity recognizer,
    and doesn't do any active learning.
    """
    log("RECIPE: Starting recipe ner.manual", locals())
    nlp = spacy.load(spacy_model)
    log("RECIPE: Loaded model {}".format(spacy_model))
    labels = get_labels(label, nlp)
    log("RECIPE: Annotating with {} labels".format(len(labels)), labels)

    my_matcher = MyPatternMatcher(nlp).from_disk(patterns)

    stream = get_stream(source, api=api, loader=loader, rehash=True,
                        dedup=True, input_key='text')

    stream = split_tokens(nlp, stream)
    stream = my_matcher(stream) # adds spans to task based on patterns matched

    return {
        'view_id': 'ner_manual',
        'dataset': dataset,
        'stream': stream,
        'exclude': exclude,
        'config': {'labels': labels}
    }

Thanks

I think the off-by-one error might occur because the "token spans" in the interface don't match the token indices of the pre-annotated spans. For example, the first token would be "start": 0, "end": 0, not "start": 0, "end": 1. Not sure if this is a bug or inconsistency in the version of Prodigy you're using, or somewhere in your logic that adds the token positions to your matched spans.

Yes – that's a bug and the fix will be included in the next release. Since the current state of ner.manual doesn't yet support pre-annotated spans, the interface only rendered the spans when the user updated the annotations – but not on mount.

I've been following your example below

{
    "text": "Hello Apple",
    "tokens": [
        {"text": "Hello", "start": 0, "end": 5, "id": 0},
        {"text": "Apple", "start": 6, "end": 11, "id": 1}
    ],
    "spans": [
        {"start": 6, "end": 11, "startIdx": 1, "endIdx": 2, "label": "ORG"}
    ]
}

This is how my output for task looks like

{
    "text": "Unspecified malignant neoplasm of skin of unspecified part of face",
    "_input_hash": 1363891672,
    "_task_hash": 35627368,
    "tokens": [
        {
            "text": "Unspecified",
            "start": 0,
            "end": 11,
            "id": 0
        },
        {
            "text": "malignant",
            "start": 12,
            "end": 21,
            "id": 1
        },
        {
            "text": "neoplasm",
            "start": 22,
            "end": 30,
            "id": 2
        },
        {
            "text": "of",
            "start": 31,
            "end": 33,
            "id": 3
        },
        {
            "text": "skin",
            "start": 34,
            "end": 38,
            "id": 4
        },
        {
            "text": "of",
            "start": 39,
            "end": 41,
            "id": 5
        },
        {
            "text": "unspecified",
            "start": 42,
            "end": 53,
            "id": 6
        },
        {
            "text": "part",
            "start": 54,
            "end": 58,
            "id": 7
        },
        {
            "text": "of",
            "start": 59,
            "end": 61,
            "id": 8
        },
        {
            "text": "face",
            "start": 62,
            "end": 66,
            "id": 9
        }
    ],
    "spans": [
        {
            "text": "skin",
            "start": 34,
            "end": 38,
            "startIdx": 4,
            "endIdx": 5,
            "label": "BODY_PART"
        },
        {
            "text": "face",
            "start": 62,
            "end": 66,
            "startIdx": 9,
            "endIdx": 10,
            "label": "BODY_PART"
        }
    ]
}

Not sure what I'm doing wrong. I'm using latest version spacy==2.0.5 and prodigy==1.2.0

Thanks

Thanks for sharing the example – I'll try it out and have a look.

As I mentioned above, annotating pre-defined spans in the manual interface is not "officially" supported in Prodigy v1.2.0 – so everything you're doing here is pretty experimental and may not work perfectly. The updated interface and new ner.make-gold workflow coming in v1.3.0 will implement all of the required changes in the core library, so you'll be able to use this workflow out-of-the-box.

Any estimates how long we may have to wait?

@imranarshad There’ll hopefully be another update this week. We do want to get a few other changes and fixes in as well.

Awesome @ines

Thanks

Two questions on workflow and training for the manual interface:

  1. should the manual annotations be exhaustive for the text, like a “GoldParse”, or can they be incomplete like ner.teach?
  2. how does --exclude work for the manual interface? Will it just hash the text since there’s no proposed label? It would be nice if it knew which tags were available in the manual interface and incorporated that, so another session with different active tags could see the sentence again.

With no model to update, you can use your own target strategy. I think doing one entity at a time exhaustively will be a good approach

It would currently exclude by input hash. I think you'd be best off writing your own filter to give you fine-grained control.

1 Like

Hey @ines how is progress with newer version.

@imranarshad Sorry for the delay – we had to push an update to spaCy first, and then ended up implementing some more features.

Just released Prodigy v1.3.0! :tada: See here for the new ner.make-gold workflow. There’s now also a pos.make-gold recipe for annotating part-of-speech tags in a similar way. See the new changelog for an overview of all updates in the new version.

3 Likes