We have started using ner.manual for annotation. However we would like to switch to the custom html view in order to render the tokens in a special way to make them easier for annotation.
How easy would it be to incorporate the logic for annotating labels (along with the label header on top) in a custom html view and how would we go about doing it?
The upoming version will allow custom global CSS and custom JavaScript across all interfaces, which should make it easy to customise the visual presentation and how the task is updated with the annotations.
One note about the manual interface and how the tokens are presented (also in case others come across this thread later): If you're planning on training a model on the labelled data later on, especially NLP models on the raw text, you usually want to make sure that you're always able to resolve the tokens back to their offsets in the original text, and that there's no mismatch between what the annotator is seeing vs. what the true underlying text is. That's also part of the reason Prodigy's manual interface uses raw text only, with an option of visualising whitespace characters. I've explained this in some more detail on this thread:
The data that we are working on are highly structured so in order to aid the annotation, it makes sense to change a bit only the display of the pre-computed tokens (which already map to specific offsets in the raw text). For instance, render every newline in a whitespace token. Additionally, even if we keep the current UI for the tokens, since we are dealing with long texts we would like to make the label bar follow the scroll of the page, or show the metadata elsewhere (probably at the top).
I am aware of your philosophy of annotating smaller pieces of text and I hope we get there eventually, but for now do you have any suggestions? Or will these be possible only in the upcoming version?
Are you using the latest version of Prodigy? Because some of the things you describe should already be happening by default. For longer texts, the header containing the labels should be sticky and stay at the top. And if your tokenized data contains newlines as single tokens (and you didn’t manually set "hide_true_newline_tokens" to true), newlines within the tokens should be rendered as newlines.
We are on version 1.6.1 and the label header is not sticky, ie you have to manually scroll to the top to see it.
Also, for newlines we are seeing things like in the attached image, where there are rendered newlines, but no line change.
That's strange – I definitely wanna look into that then. Which browser are you using?
About the newlines: The intended result is definitely like newline #1 and newline #3 – a visual indicator that can be highlighted, plus an actual newline. (You can disable the highlighting for the tokens by setting "disabled": true on the token in "tokens" btw).
If you look at the underlying JSON task and the "tokens" property, is there anything that stands out about newline #2?
We are mostly working on Chrome, but I am seeing the same behavior on firefox as well.
Newline #2 contains additional whitespace as well, so it's not a single new-line character and I think that's the reason it does not break to a new line. And this is one example where we would like to make a change in the UI and force a line break (whilst keeping the same tokens).
If nothing goes wrong, we should have v1.7 up today so you can try again with the new version and see if it resolved the sticky header issue. You'll also be able to use custom JavaScript and CSS overrides straight from your recipes and config, which should make it easier to customise the presentation, or dynamically manipulate what the annotator is seeing.
Okay yes, that makes sense! For this particular case, I'm not 100% sure how you would best solve this through UI customisations (without reinventing the whole rendering logic). Each highlightable unit is a token and if a token only contains a newline, it becomes a "special" newline token – which is already kind of an exception to the otherwise very tight coupling of incoming data and rendering. This is by design, because every small modification the app makes to the incoming data when rendering it can potentially have an impact on consistency and quality of the annotations (especially if they'll be used for machine learning).
So for this case, it might be easiest to just provide additional tokenization that expresses how you want to display the tokens? For example, you could have "tokens", which is your modified custom tokenization that's displayed (with added newline tokens) and "orig_tokens", which is the original tokenization aligned with the text.
But I'll think about this some more, maybe I'll come up with a better idea!
Sounds great, looking forward to playing with these additions!
Yeah this is a good idea. The only drawback is that it would change the indices of the annotated spans (labels) and it would need some extra post-processing to re-align them to the "orig_tokens". But it should be doable! Thanks for the responses!