✨ Demo: fully manual NER annotation interface

ines · January 9, 2018, 10:40pm

Thanks!

This is a good idea – at least, it could be an option that users could toggle. For long texts with many labels, a choice-like list could easily get a little messy. But if you only have a few labels, this is definitely nicer. Even better if we also add the keyboard shortcuts. Will definitely try it out and keep you updated!

Edit: The only issue here is that in order to keep the highlighting flow smooth, the labels should be selected before the span. (Changing a span retroactively is difficult, because it means there needs to be a way to select an already added span, a different highlighting style for selected entities etc.) But a lot of this comes down to visual presentation – so maybe we should just display the labels as buttons within the annotation card heading... (Sorry, mostly thinking aloud here )

ines · January 10, 2018, 10:48am

Quick preview of alternative label style – still need to add keyboard shortcuts. It currently checks whether a "ner_manual_label_style" option is set (either "dropdown" or "list") and sets the label style based on that. If not, a list is used for label sets of 8 or less, and a dropdown for larger sets.

lukateake · January 10, 2018, 3:34pm

Congratulations, @ines! Prodigy is exactly what my training workflow has been missing. Now I’m able to better leverage the expertise of non-technical stakeholders.

Please allow me to echo @imranarshad in requesting your further investigation of a right-click menu following text selection. I feel it’s more natural for users, plus, I had the opportunity to witness an excellent example in Palantir’s context menu. They make it available via their Blueprint framework.

I’m happy to contribute in any way as I have considerable front-end experience w/ React and again thank you for such a wonderful annotation tool!

ines · January 10, 2018, 4:24pm

@lukateake Thanks a lot!

The main problem I currently see with a context menu is that it hijacks the browser’s native behaviour (something I think should ideally be avoided if possible) with very little benefits for the actual user experience. After all, the main purpose of the interface is to be as fast and intuitive as possible to navigate.

I also think there’s a big advantage in making all available user actions visible at first glance and not hiding them behind other interactions. This is a pretty consistent UX pattern across all of Prodigy’s interfaces, and it also ensures both click-based and keyboard-enhanced workflows, depending on the user’s preference. For example, in my preview above, an annotation sequence could look like: 1 → highlight → 3 → highlight → A (accept) (I haven’t added the key indicators yet, but it’d be similar to the choice interface.)

Adding another click-based interaction to the labelled entity spans would also introduce several other problems: we’d have to break with the simple workflow of immediately locking in the spans. Instead, the UI would have to “wait” for the user to set the label. Deleting spans would also be more difficult. Having a significant action on both the left and right click is pretty problematic – especially if it’s deleting (!) the span vs. setting a label.

imranarshad · January 10, 2018, 5:00pm

I’m trying to make it work with my own recipe (no luck so far) but is there any way to use manual training with already suggested annotations via patterns? It would save a lot of time.

ines · January 10, 2018, 5:22pm

Did you have a look at the prodigy.ner.manual recipe yet? The only important thing here is to use the split_tokens(nlp, stream) function on your stream (via prodigy.components.preprocess.split_tokens). This will add a "tokens" key to your tasks that include the individual tokens and enable the token-based boundary selection.

You can also check out the "Annotation task formats" section in your PRODIGY_README.html for an example of how the list of tokens should look.

Not yet, but it's definitely something we want to add! At the moment, already present spans on the stream are reset when running ner.manual (i.e. within the split_tokens preprocessor). This is because the manual mode requires two additional attributes on the span: the start token index of the span and the end token index of the span. Otherwise, Prodigy can't resolve the span back to the original token positions.

But if you want to play with this, you can write your own function to add "tokens" to your annotation tasks, and add a startIdx and endIdx to your already existing spans. This is still experimental, though.

{
    "text": "Hello Apple",
    "tokens": [
        {"text": "Hello", "start": 0, "end": 5, "id": 0},
        {"text": "Apple", "start": 6, "end": 11, "id": 1}
    ],
    "spans": [
        {"start": 6, "end": 11, "startIdx": 1, "endIdx": 2, "label": "ORG"}
    ]
}

imranarshad · January 10, 2018, 6:02pm

I was thinking to use merge_span function on db-out output for already trained data, and then feed it to manual NER where spans index will have multiple sets of label - so UI can show those labels already annotated. Do I make any sense here?

I’m already playing with your code to customize for my needs, wish me luck. Thanks for the great tool and quick response.

lukateake · January 10, 2018, 6:06pm

Good luck, @imranarshad! I’m looking to do the same thing as well.

imranarshad · January 10, 2018, 6:09pm

@lukateake I will post the recipe, if I get lucky. Thanks

ines · January 10, 2018, 6:12pm

I hope it works🤞 Sorry if this is still a little hard at the moment – as I said, that part is experimental. We're currently working on adding the required functions for this to the Prodigy core library. There'll be a helper that reconstructs the span-token indices, and the built-in NER recipes will also make sure to add them to the spans by default.

This means we could also add a --manual or --edit flag to the active-learning-powered recipes like ner.teach. So you could just do ner.teach dataset en_core_web_sm my_data.jsonl --manual and it'd show you the recognised spans, but make the task editable

Yes, this makes sense! I haven't tested it, but it sounds reasonable. Btw, here's a simple split_tokens function you can use to add the "tokens" key yourself:

def split_tokens(nlp, stream):
    for eg in stream:
        doc = nlp(eg['text'])
        eg['tokens'] = [{'text': token.text, 'start': token.idx,
                         'end': token.idx + len(token.text), 'id': i}
                        for i, token in enumerate(doc)]
        yield eg

Now your pre-annotated spans only need a startIdx and endIdx (the naming is bad, this will be changed in the next update!). So a span describing token 0 to token 4 will need "startIdx": 0, "endIdx": 5.

imranarshad · January 10, 2018, 6:21pm

@ines awesome thanks. Can’t wait to try it out

ines · January 10, 2018, 6:25pm

Good news btw – just got it working locally So we’ll definitely ship this with the next release. Including the --manual flag on ner.teach!

Edit: Or, probably a better solution to not interfere with the active learning component and prefer_uncertain on ner.teach: This workflow could replace the current ner.make-gold. So instead of seeing the most uncertain prediction, you see the best parse and can correct it. (Ha, you’re really getting live insights into the Prodigy development process here, haha.)

hannahlindsley · January 10, 2018, 9:58pm

Yes! Enabling a zero-click interaction context (all keyboard) would save my liiiiiife

Also character-level selection, though that can certainly be a mode.

Do you plan to enable relation annotation at any point?

hannahlindsley · January 10, 2018, 10:00pm

FWIW, the documentation seems not to indicate that this feature is not yet supported. I try -v ner_manual on a mark recipe, and it looks like it’s working wonderfully until I see that I have the dreaded “no tasks” screen.

ines · January 10, 2018, 10:26pm

Ah yeah, it's kinda lumped it with the ner interface at the moment – I was thinking about writing a simple demo script that mimics the highlighting behaviour, though, to showcase it better (similar to the demo posted in the first post of this thread).

The problem with using mark and the ner_manual interface is that Prodigy needs a model or at least a tokenizer to split the text into tokens. So you'll either need to feed in tasks that already have a "tokens" property set (see the example data I posted above), or just use ner.manual instead.

Yesss, we'd love to make this happen – but it's pretty difficult to get right. And if we do it, we want it to be actually good and useful. The "boundaries" interface sort of went in that direction, but it came with all kinds of other problems. But we'll keep experimenting.

Either that, or you could add your own tokenization rules, for example, if you need to handle certain characters or punctuation differently. It might take you 20 minutes to write a few regular expressions and add them to spaCy's tokenizer – but that's still a lot more efficient overall than adding 5 more seconds to each individual annotation decision.

Yes, that's definitely on the roadmap. Our current idea is to use a simplified, displaCy-style interface and a workflow similar to NER annotation. Edit: Forgot to add – in the meantime, here are some ideas and strategies for how to make dependency / relation annotation work with the current interfaces.

erikwiffin · January 11, 2018, 8:46pm

Is there planned support for newlines in this interface? I’d expect them to just work, since white-space is set to pre-wrap, but something about the spans being inline-block seems to be preventing it.

I’ve copied your demo here https://codepen.io/erikwiffin/pen/EoLogq but I added newlines to the raw text. As you can see, they aren’t being rendered in the interface.

ines · January 11, 2018, 9:13pm

@erikwiffin Ah, my demo is a bit rudimentary and doesn’t necessarily reflect the actual rendering of Prodigy. However, I just tested it in the interface and you’re right – because the newline is enclosed in an inline-block element, it doesn’t cause the surrounding inline-block elements to reflow and instead, just stretches the token container (which makes sense).

So thanks for bringing that up! I just tested it briefly and I think it should be fine to keep the tokens as regular inline elements.

Because ner.manual pre-tokenizes the text using spaCy, newlines will only ever occur if there are multiple of them (which spaCy will preserve as individual tokens, to always keep a reference to the original text). But in cases like this, they’re especially relevant. They also sometimes throw off the model’s predictions, which causes \n tokens to be labelled as entities. So we definitely don’t want to swallow them. (In the worst case scenario, annotators might even accidentally select newline tokens as part of entities and fuck up the model in very confusing ways.)

One solution could be to port over the whitespace indicators I implemented as an experimental feature for the latest version – see this thread for details. You can already test this by setting "show_whitespace": true in your prodigy.json, and running the regular ner interface.

Edit: I ran a few tests and it turns out that inline-block tokens are necessary in order to allow cross-browser compatible selection by double-clicking a token. Block-scoping the token constrains the highlighting to the element boundaries. As a solution for now, I’m simply replacing \n and \t with ↵ and ⇥ (during rendering only).

Styling those elements is a little tricky, because it makes the highlighting logic more difficult. So the visual output is currently ambiguous if the input text contains the unicode characters ↵ or ⇥ (which should still be a lot less common than \n or \t, so it might be decent compromise for now).

kevinrosenberg21 · January 12, 2018, 12:58pm

Wow this is amazing and really needed! Thank you!

imranarshad · January 14, 2018, 10:36pm

I've built spans just like you proposed for tasks using spacy's PatternMatcher which returns multiple matches if available but ner_manual view seems a little off - a token may be. Also firsts task goes unannotated, I've to ignore it to get new task which is annotated.

Also while building spans, I had to normalize overlapping spans and drop smaller spans in favor of longer ones so manual view gets as cleaned up data as possible.

I could not use ner.teach recipe because prodigy's PatternMatcher returns only one match as per my understanding. I like to use the knowledge of patterns and model predictions to be used in manual view.

Here is how my recipe looks like. Would really appreciate some quick help.

@prodigy.recipe('ner.semi-manual',
        dataset=prodigy.recipe_args['dataset'],
        spacy_model=prodigy.recipe_args['spacy_model'],
        source=prodigy.recipe_args['source'],
        api=prodigy.recipe_args['api'],
        loader=prodigy.recipe_args['loader'],
        label=prodigy.recipe_args['label'],
        patterns=prodigy.recipe_args['patterns'],
        exclude=prodigy.recipe_args['exclude'])
def manual(dataset, spacy_model, source=None, api=None, loader=None,
           label=None, patterns=None, exclude=None):
    """
    Mark spans by token. Requires only a tokenizer and no entity recognizer,
    and doesn't do any active learning.
    """
    log("RECIPE: Starting recipe ner.manual", locals())
    nlp = spacy.load(spacy_model)
    log("RECIPE: Loaded model {}".format(spacy_model))
    labels = get_labels(label, nlp)
    log("RECIPE: Annotating with {} labels".format(len(labels)), labels)

    my_matcher = MyPatternMatcher(nlp).from_disk(patterns)

    stream = get_stream(source, api=api, loader=loader, rehash=True,
                        dedup=True, input_key='text')

    stream = split_tokens(nlp, stream)
    stream = my_matcher(stream) # adds spans to task based on patterns matched

    return {
        'view_id': 'ner_manual',
        'dataset': dataset,
        'stream': stream,
        'exclude': exclude,
        'config': {'labels': labels}
    }

Thanks

ines · January 15, 2018, 3:39pm

I think the off-by-one error might occur because the "token spans" in the interface don't match the token indices of the pre-annotated spans. For example, the first token would be "start": 0, "end": 0, not "start": 0, "end": 1. Not sure if this is a bug or inconsistency in the version of Prodigy you're using, or somewhere in your logic that adds the token positions to your matched spans.

Yes – that's a bug and the fix will be included in the next release. Since the current state of ner.manual doesn't yet support pre-annotated spans, the interface only rendered the spans when the user updated the annotations – but not on mount.

Topic		Replies	Views
Resetting highlight for NER manual enhancement , ner , front-end	1	480	July 30, 2018
Correction of annotation in UI enhancement , done	5	1349	December 25, 2017
Customize NER manual view to allow annotators to link additional info for each highlighted entity span usage , custom , front-end	1	204	March 1, 2024
Fully manual NER annotations without tokeniser enhancement , ner , done	3	996	June 17, 2020
Creating a custom interface based on ner_manual docs , ner , custom , front-end	2	200	February 20, 2024

✨ Demo: fully manual NER annotation interface

Related topics