Does Prodigy support HTML annotation for NER

I have quite a lot of HTML pages. My goal is to train a model in spacy that can recognize entities such as product name, addresses and monetary amount.

My question is does prodigy support annotation on html pages. not raw html code per se, but annotation on rendered html with a custom list of entities.

Thank you!

Jay

Annotating rendered HTML might sound appealing at first, but there's actually not really an easy answer for how the annotations should be resolved back to the underlying raw text and how to ensure that annotations are consistent. After all, what your model will get to see is the raw text.

I discuss some of these considerations in more detail on this thread:

One common solution is to write a function that takes raw HTML, strips out the markup, tokenizes the text and stores each token's character offset into the original raw HTML. This way, you can work with raw text without markup, while still being able to resolve the character offsets of your annotations back to the original input.

Hi @ines

I just stumpled upon this and I'm trying to wrap my head around it. But I just can't understand :slight_smile: Do you mind put a few extra words to it? Or add a few lines of code for illustration?

My use case is for span classification, i.e. subhead classification. I know I have to preprocess the html into text at some point but I like to use the HTML in the tokenization process (improved sentence boundaries e.g.). So the input is really HTML and then I want to mark where the subheads are - simply the offsets/positions of characters.

Alternatively I could just preprocess the HTML to text and start highlighting subheads from there but they are a lot harder to find then. And the classification would then depend on using that exact html to text method

hi @nix411!

I can't speak for Ines, but have you seen @pmbaumgartner's spacy-html-tokenizer?

It does the first few steps, but not the character offsets. Perhaps you could look at Peter's code and modify some steps.

I suspect (with my limited knowledge) that there's no general solution that's easy, which is reflected in Ines' point.