Best way to format text for annotation?

ines · May 7, 2018, 10:41am

The "text" property is always expected to be the raw text and is also what’s used for training the model. You can use HTML with the "html" interface – but doing this for manual NER annotation is quite difficult. Ultimately, this comes down to a similar reason as the one described in this thread: If you’re annotating text by hand, you’re creating character offsets and you need every character and every token. You also need a way of distinguishing between formatting and bad markup in your raw data, which can be very difficult.

For example, if you have a lot of newlines in your text, those will impact the model’s predictions, so the training data should reflect that. Same with leftover HTML markup – random HTML tags are super common in data that wasn’t perfectly cleaned, so during annotation, you definitely want to see things like <span></span> (which otherwise would be invisible). Finally, the manual interface pre-tokenizes the text to make selection more efficient – so if the text contains formatting, you’d have to ensure that the tokenizer always leaves it intact or alternatively, we’d have to let you disable the token-based selection (which, in turn, will then make annotation more difficult again for your annotators). Those are all reasons why formatting text in manual annotation mode can be potentially problematic.

It’s actually a tricky problem, and I don’t think I have a perfect solution yet. I definitely see the logic behind formatting text in manual annotation mode, especially for emphasis or to structure it in a way that makes it easier to work with (something similar also came up here, where a user wants to display text as a list). We could probably allow token-based styling, but this might be annoying to generate, especially programmatically… I’ll think about this, and I’m also very open for suggestions and examples of potential use cases!

Topic		Replies	Views
Is there any way to annotate text with HTML tags in it ? ner , spacy	1	28	February 25, 2025
NER manual on view id HTML usage , ner , custom	1	871	May 16, 2019
Re-use UI elements usage , front-end	8	965	February 18, 2019
About html ner extraction usage , ner	3	517	June 15, 2021
In "textcat" recipes, is it possible to format the to-be-annotated texts? usage , textcat , done , front-end	7	626	October 7, 2019

Best way to format text for annotation?

Related topics