Email rendering in Text field - HTML

Hi -

I tried looking through past support tickets but I can't seem to find a solution for this. I'm trying to render samples of email bodies in the text field of the manual annotation tool. Problem is, there is heavy HTML tagging, and I need a way to be able to display the text as an actual email, as the users are unable to parse HTML to find the content they need.

I see the "Why can't I add HTML formatting to the manual UI" section in https://prodi.gy/docs/api-interfaces#text-render-formatted. Is it impossible to build or activate an HTML parser in the annotation tool?

Thanks.

Hi! So what's the result you're looking for in terms of markup, and how would you solve the conceptual problem of mapping the rendered HTML markup back to plain text?

I think the simplest solution would be to add a preprocessing step that strips out the HTML and/or replaces it in a way that's easy to read for humans, while still retaining a reference to the formatting (if that's important to you). If you need to map the annotations you collect back to the original HTML markup, you'd have to keep a reference to the character offsets into the text, so you can relate your annotations back. It's more involved, but it's a common way to deal with HTML in NLP.

Thanks! Idea would be to map annotations back to the original text based on an ID in the text. I tried preprocessing the data but unfortunately there are all types of emails we're trying to get our users to annotate - emails that contain tables that format vertically when stripped down to just text, stacked images, chats, etc. - and this makes the text hard to read for the annotator.

If rendering HTML isn't possible, is it possible to highlight a list of words/phrases in parts of the text that is being annotated?

Yeah, the underlying problem here is just that there's not really a logical way to annotate or treat rendered HTML like images or charts as plain text and then train a model on the resulting plain text. The manual spans interface is intended for annotating text, most commonly for named entity recongition, but also other sequence prediction tasks.

I think the easiest solution would be to use patterns – see here for details. You can either use token-based patterns or exact string matches – for instance, the following pattern would pre-highlight all occurrences of "Hello" as GREETING:

{"pattern": "Hello", "label": "GREETING"}
1 Like

Seems like adding a patterns json file is the way to go. Thanks!

@ines Is it possible to run a patterns file for textcat.manual?

I have a JSONL file following the format of patterns as shown in Prodigy Manual Patterns. I've replaced the names of my files with <> :

>> prodigy textcat.manual <dataset> <data file> --label <Label> --exclusive --patterns <patterns file>
Using 1 label(s): <Label>
usage: prodigy textcat.manual [-h] [-a None] [-lo None] [-l None] [-E]
                              [-e None]
                              dataset [source] [_]
prodigy textcat.manual: error: unrecognized arguments: --patterns <patterns file>

It seems that the example only shows use for ner.manual. Is there any way to get the patterns file fed into textcat.manual? Thanks!

What exactly are you trying to achieve with the pattern in textcat.manual? Pre-select examples to annotate? The recipe is mostly intended for going through all examples and selecting one or more categories that apply to the text – so there's not really a logical place for the patterns to fit in.

If you just want to pre-select examples using pattern matches, you could just use the patterns in spaCy directly, save out all texts with at least one match (or any other logic you need) and then use that as the input for annotation.