Is there any way to annotate text with HTML tags in it ?

dhavalv83 · February 21, 2025, 11:24am

Hello,

We have texts with multiple different HTML tags in it. We want to annotate the texts in such a way that HTML tags not shown in UI, but their indexes count in tokens and spans. The text should also be seen in well formatted as per the HTML tags.

In the attached image,
The text on the left side is original text with HTML tags. We want to do something like right part of the image, where the text is shown in proper structure based on HTML tags, where user can annotate text, just like "Multiplex Assays".

Note: The texts can be very large, too.

magdaaniol · February 25, 2025, 8:53am

Hi @dhavalv83,
Annotating spans in HTML is not trivial because it is unclear how the raw HTML should be split into tokens, especially in the case of complex and nested markup. Additionally, not seeing parts of the raw text while annotating also tends to lead to problems during modeling. Please see these posts from Ines where she provides more details on the topic as well as possible approaches:

You can, of course, build a custom tokenizer to help you separate HTML tags from the raw text (similar to this one maybe) and have a meaningful tokenization of the HTML. You can even set the style of the HTML tokens to none (by adding "style": {"display": "none"} to tokens' dictionary) so that they are not visualized in the Prodigy UI, but you'd still have to annotate on the raw text i.e., using ner_manual or spans_manual UI. Marking spans in the html UI is not enabled for the reasons discussed in the cited posts. This means that your span and token offsets will map back to the original HTML, and you won't be seeing the markup, but you won't get the HTML layout in the UI. Note that the style key lets you specify any CSS style so you do have a way to define how tokens should be displayed.

Some other approaches to consider:

You might show both the HTML and raw text version of it using blocks. The HTML would be there to facilitate the reading, and the raw text is where the actual annotation would take place. With long texts, it might be counterproductive, though.
You might also consider a preprocessing routine that would strip the HTML altogether and pre-format the text using whitespaces. You would work with such preprocessed text both for annotation and training. This, of course, depends on how complex the layouts are.
Finally, if the HTML and layouts are complex and varied and are important for modeling, you might consider converting the data to PDF and annotating using Prodigy pdf.spans.manual instead.

Topic		Replies	Views
NER manual on view id HTML usage , ner , custom	1	869	May 16, 2019
Best way to format text for annotation? enhancement , ner , front-end	3	2390	June 27, 2019
Does Prodigy support HTML annotation for NER usage , ner	3	1212	December 1, 2022
Re-use UI elements usage , front-end	8	965	February 18, 2019
About html ner extraction usage , ner	3	516	June 15, 2021

Is there any way to annotate text with HTML tags in it ?

Related topics