Best way to format text for annotation?

The "text" property is always expected to be the raw text and is also what’s used for training the model. You can use HTML with the "html" interface – but doing this for manual NER annotation is quite difficult. Ultimately, this comes down to a similar reason as the one described in this thread: If you’re annotating text by hand, you’re creating character offsets and you need every character and every token. You also need a way of distinguishing between formatting and bad markup in your raw data, which can be very difficult.

For example, if you have a lot of newlines in your text, those will impact the model’s predictions, so the training data should reflect that. Same with leftover HTML markup – random HTML tags are super common in data that wasn’t perfectly cleaned, so during annotation, you definitely want to see things like <span></span> (which otherwise would be invisible). Finally, the manual interface pre-tokenizes the text to make selection more efficient – so if the text contains formatting, you’d have to ensure that the tokenizer always leaves it intact or alternatively, we’d have to let you disable the token-based selection (which, in turn, will then make annotation more difficult again for your annotators). Those are all reasons why formatting text in manual annotation mode can be potentially problematic.

It’s actually a tricky problem, and I don’t think I have a perfect solution yet. I definitely see the logic behind formatting text in manual annotation mode, especially for emphasis or to structure it in a way that makes it easier to work with (something similar also came up here, where a user wants to display text as a list). We could probably allow token-based styling, but this might be annoying to generate, especially programmatically… I’ll think about this, and I’m also very open for suggestions and examples of potential use cases!