✨ Demo: fully manual NER annotation interface

Thanks! :tada:

This is a good idea – at least, it could be an option that users could toggle. For long texts with many labels, a choice-like list could easily get a little messy. But if you only have a few labels, this is definitely nicer. Even better if we also add the keyboard shortcuts. Will definitely try it out and keep you updated!

Edit: The only issue here is that in order to keep the highlighting flow smooth, the labels should be selected before the span. (Changing a span retroactively is difficult, because it means there needs to be a way to select an already added span, a different highlighting style for selected entities etc.) But a lot of this comes down to visual presentation – so maybe we should just display the labels as buttons within the annotation card heading... (Sorry, mostly thinking aloud here :wink: )

1 Like

Quick preview of alternative label style – still need to add keyboard shortcuts. It currently checks whether a "ner_manual_label_style" option is set (either "dropdown" or "list") and sets the label style based on that. If not, a list is used for label sets of 8 or less, and a dropdown for larger sets.

10

1 Like

Congratulations, @ines! Prodigy is exactly what my training workflow has been missing. Now I’m able to better leverage the expertise of non-technical stakeholders.

Please allow me to echo @imranarshad in requesting your further investigation of a right-click menu following text selection. I feel it’s more natural for users, plus, I had the opportunity to witness an excellent example in Palantir’s context menu. They make it available via their Blueprint framework.

I’m happy to contribute in any way as I have considerable front-end experience w/ React and again thank you for such a wonderful annotation tool!

@lukateake Thanks a lot!

The main problem I currently see with a context menu is that it hijacks the browser’s native behaviour (something I think should ideally be avoided if possible) with very little benefits for the actual user experience. After all, the main purpose of the interface is to be as fast and intuitive as possible to navigate.

I also think there’s a big advantage in making all available user actions visible at first glance and not hiding them behind other interactions. This is a pretty consistent UX pattern across all of Prodigy’s interfaces, and it also ensures both click-based and keyboard-enhanced workflows, depending on the user’s preference. For example, in my preview above, an annotation sequence could look like: 1 → highlight → 3 → highlight → A (accept) (I haven’t added the key indicators yet, but it’d be similar to the choice interface.)

Adding another click-based interaction to the labelled entity spans would also introduce several other problems: we’d have to break with the simple workflow of immediately locking in the spans. Instead, the UI would have to “wait” for the user to set the label. Deleting spans would also be more difficult. Having a significant action on both the left and right click is pretty problematic – especially if it’s deleting (!) the span vs. setting a label.

I’m trying to make it work with my own recipe (no luck so far) but is there any way to use manual training with already suggested annotations via patterns? It would save a lot of time.

1 Like

Did you have a look at the prodigy.ner.manual recipe yet? The only important thing here is to use the split_tokens(nlp, stream) function on your stream (via prodigy.components.preprocess.split_tokens). This will add a "tokens" key to your tasks that include the individual tokens and enable the token-based boundary selection.

You can also check out the "Annotation task formats" section in your PRODIGY_README.html for an example of how the list of tokens should look.

Not yet, but it's definitely something we want to add! At the moment, already present spans on the stream are reset when running ner.manual (i.e. within the split_tokens preprocessor). This is because the manual mode requires two additional attributes on the span: the start token index of the span and the end token index of the span. Otherwise, Prodigy can't resolve the span back to the original token positions.

But if you want to play with this, you can write your own function to add "tokens" to your annotation tasks, and add a startIdx and endIdx to your already existing spans. This is still experimental, though.

{
    "text": "Hello Apple",
    "tokens": [
        {"text": "Hello", "start": 0, "end": 5, "id": 0},
        {"text": "Apple", "start": 6, "end": 11, "id": 1}
    ],
    "spans": [
        {"start": 6, "end": 11, "startIdx": 1, "endIdx": 2, "label": "ORG"}
    ]
}

I was thinking to use merge_span function on db-out output for already trained data, and then feed it to manual NER where spans index will have multiple sets of label - so UI can show those labels already annotated. Do I make any sense here?

I’m already playing with your code to customize for my needs, wish me luck. Thanks for the great tool and quick response.

Good luck, @imranarshad! I’m looking to do the same thing as well.

@lukateake I will post the recipe, if I get lucky. Thanks

I hope it works🤞 Sorry if this is still a little hard at the moment – as I said, that part is experimental. We're currently working on adding the required functions for this to the Prodigy core library. There'll be a helper that reconstructs the span-token indices, and the built-in NER recipes will also make sure to add them to the spans by default.

This means we could also add a --manual or --edit flag to the active-learning-powered recipes like ner.teach. So you could just do ner.teach dataset en_core_web_sm my_data.jsonl --manual and it'd show you the recognised spans, but make the task editable :tada:

Yes, this makes sense! I haven't tested it, but it sounds reasonable. Btw, here's a simple split_tokens function you can use to add the "tokens" key yourself:

def split_tokens(nlp, stream):
    for eg in stream:
        doc = nlp(eg['text'])
        eg['tokens'] = [{'text': token.text, 'start': token.idx,
                         'end': token.idx + len(token.text), 'id': i}
                        for i, token in enumerate(doc)]
        yield eg

Now your pre-annotated spans only need a startIdx and endIdx (the naming is bad, this will be changed in the next update!). So a span describing token 0 to token 4 will need "startIdx": 0, "endIdx": 5.

1 Like

@ines awesome thanks. Can’t wait to try it out

Good news btw – just got it working locally :tada: :tada: So we’ll definitely ship this with the next release. Including the --manual flag on ner.teach!

Edit: Or, probably a better solution to not interfere with the active learning component and prefer_uncertain on ner.teach: This workflow could replace the current ner.make-gold. So instead of seeing the most uncertain prediction, you see the best parse and can correct it. (Ha, you’re really getting live insights into the Prodigy development process here, haha.)

4 Likes

Yes! Enabling a zero-click interaction context (all keyboard) would save my liiiiiife :smiley:

Also character-level selection, though that can certainly be a mode.

Do you plan to enable relation annotation at any point?

1 Like

FWIW, the documentation seems not to indicate that this feature is not yet supported. I try -v ner_manual on a mark recipe, and it looks like it’s working wonderfully until I see that I have the dreaded “no tasks” screen. :scream:

Ah yeah, it's kinda lumped it with the ner interface at the moment – I was thinking about writing a simple demo script that mimics the highlighting behaviour, though, to showcase it better (similar to the demo posted in the first post of this thread).

The problem with using mark and the ner_manual interface is that Prodigy needs a model or at least a tokenizer to split the text into tokens. So you'll either need to feed in tasks that already have a "tokens" property set (see the example data I posted above), or just use ner.manual instead.

Yesss, we'd love to make this happen – but it's pretty difficult to get right. And if we do it, we want it to be actually good and useful. The "boundaries" interface sort of went in that direction, but it came with all kinds of other problems. But we'll keep experimenting.

Either that, or you could add your own tokenization rules, for example, if you need to handle certain characters or punctuation differently. It might take you 20 minutes to write a few regular expressions and add them to spaCy's tokenizer – but that's still a lot more efficient overall than adding 5 more seconds to each individual annotation decision.

Yes, that's definitely on the roadmap. Our current idea is to use a simplified, displaCy-style interface and a workflow similar to NER annotation. Edit: Forgot to add – in the meantime, here are some ideas and strategies for how to make dependency / relation annotation work with the current interfaces.

1 Like

Is there planned support for newlines in this interface? I’d expect them to just work, since white-space is set to pre-wrap, but something about the spans being inline-block seems to be preventing it.

I’ve copied your demo here https://codepen.io/erikwiffin/pen/EoLogq but I added newlines to the raw text. As you can see, they aren’t being rendered in the interface.

@erikwiffin Ah, my demo is a bit rudimentary and doesn’t necessarily reflect the actual rendering of Prodigy. However, I just tested it in the interface and you’re right – because the newline is enclosed in an inline-block element, it doesn’t cause the surrounding inline-block elements to reflow and instead, just stretches the token container (which makes sense).

So thanks for bringing that up! I just tested it briefly and I think it should be fine to keep the tokens as regular inline elements.

Because ner.manual pre-tokenizes the text using spaCy, newlines will only ever occur if there are multiple of them (which spaCy will preserve as individual tokens, to always keep a reference to the original text). But in cases like this, they’re especially relevant. They also sometimes throw off the model’s predictions, which causes \n tokens to be labelled as entities. So we definitely don’t want to swallow them. (In the worst case scenario, annotators might even accidentally select newline tokens as part of entities and fuck up the model in very confusing ways.)

One solution could be to port over the whitespace indicators I implemented as an experimental feature for the latest version – see this thread for details. You can already test this by setting "show_whitespace": true in your prodigy.json, and running the regular ner interface.

Edit: I ran a few tests and it turns out that inline-block tokens are necessary in order to allow cross-browser compatible selection by double-clicking a token. Block-scoping the token constrains the highlighting to the element boundaries. As a solution for now, I’m simply replacing \n and \t with and (during rendering only).

26

Styling those elements is a little tricky, because it makes the highlighting logic more difficult. So the visual output is currently ambiguous if the input text contains the unicode characters or (which should still be a lot less common than \n or \t, so it might be decent compromise for now).

Wow this is amazing and really needed! Thank you!

1 Like

I've built spans just like you proposed for tasks using spacy's PatternMatcher which returns multiple matches if available but ner_manual view seems a little off - a token may be. Also firsts task goes unannotated, I've to ignore it to get new task which is annotated.

Also while building spans, I had to normalize overlapping spans and drop smaller spans in favor of longer ones so manual view gets as cleaned up data as possible.

I could not use ner.teach recipe because prodigy's PatternMatcher returns only one match as per my understanding. I like to use the knowledge of patterns and model predictions to be used in manual view.

Here is how my recipe looks like. Would really appreciate some quick help.

@prodigy.recipe('ner.semi-manual',
        dataset=prodigy.recipe_args['dataset'],
        spacy_model=prodigy.recipe_args['spacy_model'],
        source=prodigy.recipe_args['source'],
        api=prodigy.recipe_args['api'],
        loader=prodigy.recipe_args['loader'],
        label=prodigy.recipe_args['label'],
        patterns=prodigy.recipe_args['patterns'],
        exclude=prodigy.recipe_args['exclude'])
def manual(dataset, spacy_model, source=None, api=None, loader=None,
           label=None, patterns=None, exclude=None):
    """
    Mark spans by token. Requires only a tokenizer and no entity recognizer,
    and doesn't do any active learning.
    """
    log("RECIPE: Starting recipe ner.manual", locals())
    nlp = spacy.load(spacy_model)
    log("RECIPE: Loaded model {}".format(spacy_model))
    labels = get_labels(label, nlp)
    log("RECIPE: Annotating with {} labels".format(len(labels)), labels)

    my_matcher = MyPatternMatcher(nlp).from_disk(patterns)

    stream = get_stream(source, api=api, loader=loader, rehash=True,
                        dedup=True, input_key='text')

    stream = split_tokens(nlp, stream)
    stream = my_matcher(stream) # adds spans to task based on patterns matched

    return {
        'view_id': 'ner_manual',
        'dataset': dataset,
        'stream': stream,
        'exclude': exclude,
        'config': {'labels': labels}
    }

Thanks

I think the off-by-one error might occur because the "token spans" in the interface don't match the token indices of the pre-annotated spans. For example, the first token would be "start": 0, "end": 0, not "start": 0, "end": 1. Not sure if this is a bug or inconsistency in the version of Prodigy you're using, or somewhere in your logic that adds the token positions to your matched spans.

Yes – that's a bug and the fix will be included in the next release. Since the current state of ner.manual doesn't yet support pre-annotated spans, the interface only rendered the spans when the user updated the annotations – but not on mount.