Best way to format text for annotation?


I’m trying to collect data for a task where the formatting for the text is incredibly useful for annotators. I’ve been using a custom recipe to generate the annotations (based off of ner.manual), with the text being the value of an ‘html’ key in the stream, but I haven’t been able to get tags such as “text” or “text” to render properly. Instead of appearing as bold, the tags appear as just the tags ("<b>text</b>") on the card. I’ve seen other posts on this forum where the tags have been used, so I’m a bit confused why this isn’t working and would like a sanity check. Thanks!

The "text" property is always expected to be the raw text and is also what’s used for training the model. You can use HTML with the "html" interface – but doing this for manual NER annotation is quite difficult. Ultimately, this comes down to a similar reason as the one described in this thread: If you’re annotating text by hand, you’re creating character offsets and you need every character and every token. You also need a way of distinguishing between formatting and bad markup in your raw data, which can be very difficult.

For example, if you have a lot of newlines in your text, those will impact the model’s predictions, so the training data should reflect that. Same with leftover HTML markup – random HTML tags are super common in data that wasn’t perfectly cleaned, so during annotation, you definitely want to see things like <span></span> (which otherwise would be invisible). Finally, the manual interface pre-tokenizes the text to make selection more efficient – so if the text contains formatting, you’d have to ensure that the tokenizer always leaves it intact or alternatively, we’d have to let you disable the token-based selection (which, in turn, will then make annotation more difficult again for your annotators). Those are all reasons why formatting text in manual annotation mode can be potentially problematic.

It’s actually a tricky problem, and I don’t think I have a perfect solution yet. I definitely see the logic behind formatting text in manual annotation mode, especially for emphasis or to structure it in a way that makes it easier to work with (something similar also came up here, where a user wants to display text as a list). We could probably allow token-based styling, but this might be annoying to generate, especially programmatically… I’ll think about this, and I’m also very open for suggestions and examples of potential use cases!

Hi, Is there an alternate to the solution suggested above for the formatted text problem?

@Sarah Hi! There's not really an easy solution, because it's more of a conceptual problem and there's no easy answer for how to treat HTML markup. I've also posted about this in more detail on this thread, and why it's difficult to render HTML if you're highlighting manually and planning on training a statistical model using the resulting data: