About html ner extraction

Hi Everyone.
I saw the topics about ner on html Using ner.manual on HTML Input and want to know if exists some update about how to do ner on raw html

If not, would be an option replace the < > chars by some token like @@ and annotate like raw (and ugly) text?


I guess Ines' original questions from that thread are still valid:

If you pass in "<strong>hello</strong>", there’s no clear solution for how this should be handled. How should it be tokenized, and what are you really labelling here? The underlying markup or just the text, and what should the character offsets point to? And how should other markup be handled, e.g. images or complex, nested tags?

with all the tags and weird characters inbetween, it seems like any ML model will have a hard time recognizing anything?

Is it not an option to preprocess your files to clean text first?

The tagging should include the "< -strong >text</-strong >"

The reason to not clear the html tags is because the tag matters
H4 is different result than h1

Is ok to clean < and >

More clear the needs now?

You can try and include that, or strip out all HTML except for some tags you care about, so your headlines end up looking like h4 Some subheadline if you want to reflect the headline weight in your input data – however, I'm really not sure how useful this will be, especially not for NER. The model has a pretty narrow context window on either side, so it mostly won't be able to take the signal from those tags into account when making its predictions.

It sounds like what you're really looking for is a way to include information about the formatting as features in your model. But this needs some experimentation and a custom implementation – and you probably want to strip this information out of the text and attach it to the tokens, instead of including the markup directly.