Is there any way to annotate text with HTML tags in it ?

Hello,

We have texts with multiple different HTML tags in it. We want to annotate the texts in such a way that HTML tags not shown in UI, but their indexes count in tokens and spans. The text should also be seen in well formatted as per the HTML tags.

In the attached image,
The text on the left side is original text with HTML tags. We want to do something like right part of the image, where the text is shown in proper structure based on HTML tags, where user can annotate text, just like "Multiplex Assays".

Note: The texts can be very large, too.

Hi @dhavalv83,
Annotating spans in HTML is not trivial because it is unclear how the raw HTML should be split into tokens, especially in the case of complex and nested markup. Additionally, not seeing parts of the raw text while annotating also tends to lead to problems during modeling. Please see these posts from Ines where she provides more details on the topic as well as possible approaches:

You can, of course, build a custom tokenizer to help you separate HTML tags from the raw text (similar to this one maybe) and have a meaningful tokenization of the HTML. You can even set the style of the HTML tokens to none (by adding "style": {"display": "none"} to tokens' dictionary) so that they are not visualized in the Prodigy UI, but you'd still have to annotate on the raw text i.e., using ner_manual or spans_manual UI. Marking spans in the html UI is not enabled for the reasons discussed in the cited posts. This means that your span and token offsets will map back to the original HTML, and you won't be seeing the markup, but you won't get the HTML layout in the UI. Note that the style key lets you specify any CSS style so you do have a way to define how tokens should be displayed.

Some other approaches to consider:

  • You might show both the HTML and raw text version of it using blocks. The HTML would be there to facilitate the reading, and the raw text is where the actual annotation would take place. With long texts, it might be counterproductive, though.

  • You might also consider a preprocessing routine that would strip the HTML altogether and pre-format the text using whitespaces. You would work with such preprocessed text both for annotation and training. This, of course, depends on how complex the layouts are.

  • Finally, if the HTML and layouts are complex and varied and are important for modeling, you might consider converting the data to PDF and annotating using Prodigy pdf.spans.manual instead.