Annotate Raw HTML

bmd123 · January 17, 2020, 11:34am

Hi Prodigy.

I am quite new to the platform so sorry if this is an obvious question, but I checked other postings but there does not seem to a similar question. I would like to annotate raw HTML from a document. At the moment when I load the HTML into Prodigy, it is rendered as text. For example: if I have code example I would like it rendered as example rather than example. .

Thanks

ines · January 17, 2020, 12:07pm

Hi! The "text" value of an annotation task will always be plain text – if you want to render HTML, you need to use the "html" key and html interface explicitly. See here for more details:

The text property of an annotation task will always be rendered as plain text. To add markup like line breaks or simple formatting, use the html key instead. Prodigy wants you to explicitly choose to use “HTML mode” to avoid stray HTML tags (which can influence the model’s predictions) from being rendered or hidden – for example, if you’re working with raw, uncleaned data. If you’re using one of Prodigy’s default recipes with a model in the loop, keep in mind that the text of an annotation task is used to update the model.

If you want to do manual annotation and highlight spans of text, or if you're creating training data for a model, there's another reason why markup is typically preserved as plain text. See here for details:

When you’re highlighting spans in the manual interface, you’re still annotating raw text and are creating spans that map to character offsets within that raw text. If you pass in "hello" , there’s no clear solution for how this should be handled. How should it be tokenized, and what are you really labelling here? The underlying markup or just the text, and what should the character offsets point to? And how should other markup be handled, e.g. images or complex, nested tags?

Similarly, if you’re planning on training a model later on, that model will also get to see the raw text, including the markup – so if you are working with raw HTML (like web dumps), you usually always want to see the original raw text that the model will be learning from . Otherwise, the model might be seeing data or markup that you didn’t see during annotation, which is problematic.

bmd123 · January 23, 2020, 10:53am

Thanks for that.

Topic		Replies	Views
Does Prodigy support HTML annotation for NER usage , ner	3	1212	December 1, 2022
NER manual on view id HTML usage , ner , custom	1	871	May 16, 2019
In "textcat" recipes, is it possible to format the to-be-annotated texts? usage , textcat , done , front-end	7	626	October 7, 2019
Is there any way to annotate text with HTML tags in it ? ner , spacy	1	29	February 25, 2025
Best way to format text for annotation? enhancement , ner , front-end	3	2395	June 27, 2019

Annotate Raw HTML

Related topics