About html ner extraction

info2000 · June 9, 2021, 11:30pm

Hi Everyone.
I saw the topics about ner on html Using ner.manual on HTML Input and want to know if exists some update about how to do ner on raw html

If not, would be an option replace the < > chars by some token like @@ and annotate like raw (and ugly) text?

Thanks

SofieVL · June 11, 2021, 2:53pm

I guess Ines' original questions from that thread are still valid:

If you pass in "<strong>hello</strong>", there’s no clear solution for how this should be handled. How should it be tokenized, and what are you really labelling here? The underlying markup or just the text, and what should the character offsets point to? And how should other markup be handled, e.g. images or complex, nested tags?

with all the tags and weird characters inbetween, it seems like any ML model will have a hard time recognizing anything?

Is it not an option to preprocess your files to clean text first?

info2000 · June 14, 2021, 3:27pm

The tagging should include the "< -strong >text</-strong >"

The reason to not clear the html tags is because the tag matters
H4 is different result than h1

Is ok to clean < and >

More clear the needs now?

ines · June 15, 2021, 3:12am

You can try and include that, or strip out all HTML except for some tags you care about, so your headlines end up looking like h4 Some subheadline if you want to reflect the headline weight in your input data – however, I'm really not sure how useful this will be, especially not for NER. The model has a pretty narrow context window on either side, so it mostly won't be able to take the signal from those tags into account when making its predictions.

It sounds like what you're really looking for is a way to include information about the formatting as features in your model. But this needs some experimentation and a custom implementation – and you probably want to strip this information out of the text and attach it to the tokens, instead of including the markup directly.

Topic		Replies	Views
Is there any way to annotate text with HTML tags in it ? ner , spacy	1	28	February 25, 2025
Best way to format text for annotation? enhancement , ner , front-end	3	2390	June 27, 2019
NER manual on view id HTML usage , ner , custom	1	869	May 16, 2019
Does Prodigy support HTML annotation for NER usage , ner	3	1212	December 1, 2022
Using ner.manual on HTML Input usage , ner , custom	3	2806	October 12, 2018

About html ner extraction

Related topics