Annotate Raw HTML

Hi Prodigy.

I am quite new to the platform so sorry if this is an obvious question, but I checked other postings but there does not seem to a similar question. I would like to annotate raw HTML from a document. At the moment when I load the HTML into Prodigy, it is rendered as text. For example: if I have code <b> example </b> I would like it rendered as <b> example </b> rather than example. .

Thanks

Hi! The "text" value of an annotation task will always be plain text – if you want to render HTML, you need to use the "html" key and html interface explicitly. See here for more details:

The text property of an annotation task will always be rendered as plain text. To add markup like line breaks or simple formatting, use the html key instead. Prodigy wants you to explicitly choose to use “HTML mode” to avoid stray HTML tags (which can influence the model’s predictions) from being rendered or hidden – for example, if you’re working with raw, uncleaned data. If you’re using one of Prodigy’s default recipes with a model in the loop, keep in mind that the text of an annotation task is used to update the model.

If you want to do manual annotation and highlight spans of text, or if you're creating training data for a model, there's another reason why markup is typically preserved as plain text. See here for details:

When you’re highlighting spans in the manual interface, you’re still annotating raw text and are creating spans that map to character offsets within that raw text. If you pass in "<strong>hello</strong>" , there’s no clear solution for how this should be handled. How should it be tokenized, and what are you really labelling here? The underlying markup or just the text, and what should the character offsets point to? And how should other markup be handled, e.g. images or complex, nested tags?

Similarly, if you’re planning on training a model later on, that model will also get to see the raw text, including the markup – so if you are working with raw HTML (like web dumps), you usually always want to see the original raw text that the model will be learning from . Otherwise, the model might be seeing data or markup that you didn’t see during annotation, which is problematic.

Thanks for that.

1 Like