Hi everyone, I thought I'd share some of our current work in progress! The first one is a UI demo of a new interface for fully manual NER annotation – i.e. highlighting a span of text and assigning a label. Once it's implemented, this interface might replace the current "boundaries" interface.
(I've only tested the demo in recent Chrome and Firefox so far, so it might not be 100% cross-browser compatible yet. This will be no problem though once it's ported over to the React app.)
The interface allows adding multiple entities per annotation task.
Selection is handled via the browser's native behaviour – this means, you can also double-click on single tokens to highlight them.
The selection is based on already existing token boundaries. This makes the click-and-drag interaction easier, because you just need to hit tokens, not exact characters. (For example, if you highlight only parts of two tokens, the full token span will be "locked in".)
Highlighted spans can be deleted by clicking on them.
To allow the highlighted span to be "locked in" immediately after highlighting, the label needs to be set before selecting the span. After adding an entity, the labels dropdown is focused again to help select the next label, if necessary. The previous decision will be remembered, which makes it faster to add multiple entities of the same type.
Highlighting nested spans is automatically prevented.
Looking forward to your feedback, thoughts and ideas! I'm also currently working on a similar interface for image annotation, e.g. highlighting rectangular and polygon shapes. We're also still thinking about how to name those interfaces. Internally, we've been calling them "unguided", but that's a little abstract. So maybe "manual" would be a better idea?
Thanks! I’m still working on implementing it, so there’s nothing to test yet – but I’ll try my best to get it finished for the upcoming release. (Not sure if we’re ready for some sort of prodigy-nightly beta tester program just yet – but we might consider it once we have a larger user base!)
In the meantime, you can already achieve something similar using the boundaries interface and ner.mark – see here for details. It currently only works for one entity and one span per task, though.
Thanks for the quick reply and the info. I’m currently using ner.mark but I’m running into some issues where the sentences are being split in the middle of entities.
e.g. Sentence: The sentence contains the ENTIRE ENTITY with some filler at the end.
is split into:
The sentence contains the ENTIRE
ENTITY with some filler at the end.
Do you have any tips for how I can tweak the model to give me the entire sentence instead of splitting over punctuation and other triggers? This data is from the web so it is a bit messy but I can skip the bad cases.
Also, I’m assuming I will create a new problem where I potentially have two entities within the same sentence. Can I still label multiple entities in the same annotation task in the boundaries interface?
I’m really excited about the new interface that you’re working on. It will make this process so much simpler.
Yes, what you describe is one of the main problems with the boundaries interface at the moment. We’ve been going back and forth on this, and it’s been difficult to find the right balance of trade-offs in terms of efficiency, user experience, annotation speed and so on.
If you look at the source of the mark function in prodigy/recipes/ner.py, you can adjust the token slice here by using a different length or smaller spans to create overlaps between them:
for i in range(0, len(doc), 9): # document slice
span = doc[i:i+9] # focused, annotatable tokens within the slice
In theory, the interface can support any number of tokens – and up to 30 if you want to use keyboard shortcuts (shift+num for tens and shift+alt+num for twenties – e.g. shift+5 for 15).
You can also remove the split_sentences(nlp, stream) pre-processor to disable splitting incoming texts into sentences. This means that the texts will be shown as they come in and you might need to do some pre-processing yourself to make them easier to work with or annotate. But it also gives you more control over how this is done.
Good news – successfully integrated the new interface into the web app last night, and it’s working pretty well so far. Still needs testing and adjustments, but it looks like we’re definitely on track for shipping it with the next release (possibly as an experimental new feature, but it still means that you’ll get to try it out)
We’ll start working on getting everything ready next week. @honnibal is still travelling, and it’s been important to us not to push any rushed updates, especially not over the holidays. But I’m definitely looking forward to getting the new features out to the community to people can start testing and using them – this is always one of my favourite parts of software development
Amazing highlighting interface! The NER active learning have been a little wobbly for us when training from scratch. This may get us started on the right track.
This may mess up the indices, but it would be great if you can highlight between tokens (like detecting missing words). Though I’m thinking highlighting the two tokens surrounding the missing word may suffice.
Actually, I think your idea of highlighting the two tokens is probably better – even if the interface did support highlighting between words. “Highlight the two tokens around the missing word” – that’s a great, straightforward annotation prompt and it’s probably quite fast, because it requires less clicking precision. The user just needs to hit somewhere within the two surrounding tokens.
This is always something I’ve found frustrating about click-and-drag interfaces – the user needs to click very precisely, and that wastes a lot of energy and attention. So I really like the token boundaries solution we’ve come up with here, and it also fits well to the Prodigy philosophy– i.e. let the machine do as much as possible. We could still offer a character-based mode that users can toggle – but I think in most cases, it’s probably more efficient to just add one or two custom tokenization rules if you need different boundaries (instead of spending ten seconds more on every annotation decision).
Yeah, the double-clicking is actually the browser’s native behaviour – I hadn’t really thought about this before I started developing the interface. I also never realised that different browsers handle this differently, so it needed a few small hacks to (hopefully) make it work consistently.
If the tokenizer splits off the ., it will be rendered as a single token and will only be highlighted if you select it. (This is the only noticable, visual difference here – punctuation, contractions etc. are separated by whitespace. But it also makes it more obvious that they are a token.)
So if you’re annotating a lot of punctuation, this might still be a little fiddly… In this case, you might also want to add a few more tokenization rules to force stricter splitting and ensure don’t end up with punctuation attached to a token. But as I said, the idea here is that writing one or two regular expressions will still be more efficient than pixel-perfect selection.
Thanks! At the moment, the interface assumes that you select the label first and then highlight the span. This has several advantages for the UI:
The entity span can be “locked in” immediately after highlighting the text and without requiring any additional user action. So if you’re annotating several entities of the same label in a row, you’ll also have to select the label once.
The UI can use the browser’s native behaviour and functionality for both highlighting, label selection etc. This reduces complexity and makes it easier to ensure cross-compatibility. For example, we won’t have to re-engineer how the browser handles selecting text – this is already built-in, and the native Selection API does the rest.
Since the labels dropdown has a tabindex, so you can tab back into it after adding an entity. Selecting an entity still requires clicking, but you’ll be able to do everything else using your keyboard as well, if you prefer. So a workflow could look like this: TAB → P (selects “PERSON”) → highlight entity → highlight another entity → TAB → O (selects “ORG”) → highlight entity → etc.