Does Prodigy support HTML annotation for NER

(Jay) #1

I have quite a lot of HTML pages. My goal is to train a model in spacy that can recognize entities such as product name, addresses and monetary amount.

My question is does prodigy support annotation on html pages. not raw html code per se, but annotation on rendered html with a custom list of entities.

Thank you!



(Ines Montani) #2

Annotating rendered HTML might sound appealing at first, but there’s actually not really an easy answer for how the annotations should be resolved back to the underlying raw text and how to ensure that annotations are consistent. After all, what your model will get to see is the raw text.

One common solution is to write a function that takes raw HTML, strips out the markup, tokenizes the text and stores each token’s character offset into the original raw HTML. This way, you can work with raw text without markup, while still being able to resolve the character offsets of your annotations back to the original input.