I tried looking through past support tickets but I can't seem to find a solution for this. I'm trying to render samples of email bodies in the text field of the manual annotation tool. Problem is, there is heavy HTML tagging, and I need a way to be able to display the text as an actual email, as the users are unable to parse HTML to find the content they need.
Hi! So what's the result you're looking for in terms of markup, and how would you solve the conceptual problem of mapping the rendered HTML markup back to plain text?
I think the simplest solution would be to add a preprocessing step that strips out the HTML and/or replaces it in a way that's easy to read for humans, while still retaining a reference to the formatting (if that's important to you). If you need to map the annotations you collect back to the original HTML markup, you'd have to keep a reference to the character offsets into the text, so you can relate your annotations back. It's more involved, but it's a common way to deal with HTML in NLP.
Thanks! Idea would be to map annotations back to the original text based on an ID in the text. I tried preprocessing the data but unfortunately there are all types of emails we're trying to get our users to annotate - emails that contain tables that format vertically when stripped down to just text, stacked images, chats, etc. - and this makes the text hard to read for the annotator.
If rendering HTML isn't possible, is it possible to highlight a list of words/phrases in parts of the text that is being annotated?
Yeah, the underlying problem here is just that there's not really a logical way to annotate or treat rendered HTML like images or charts as plain text and then train a model on the resulting plain text. The manual spans interface is intended for annotating text, most commonly for named entity recongition, but also other sequence prediction tasks.
I think the easiest solution would be to use patterns – see here for details. You can either use token-based patterns or exact string matches – for instance, the following pattern would pre-highlight all occurrences of "Hello" as GREETING:
What exactly are you trying to achieve with the pattern in textcat.manual? Pre-select examples to annotate? The recipe is mostly intended for going through all examples and selecting one or more categories that apply to the text – so there's not really a logical place for the patterns to fit in.
If you just want to pre-select examples using pattern matches, you could just use the patterns in spaCy directly, save out all texts with at least one match (or any other logic you need) and then use that as the input for annotation.