Best practice to persist info from HTML attributes

nix411 · December 12, 2018, 2:05pm

I want to use spaCy and prodigy for fact extraction. One of the subproblems of this is to identify subheadlines in my html document.

I will probably apply a mix of rule based and statistical approach for this. Sometimes the headlines are bold, sometimes not etc… My question is, what is the best way to transfer the html attributes to Doc? I would like to have a clean Doc with no HTML except as metadata attached to Tokens or Span.

honnibal · December 12, 2018, 3:43pm

I think extension attributes are probably what you’re looking for: https://spacy.io/usage/processing-pipelines#section-custom-components-attributes

Note that out of the box, these extension attributes won’t be used as features in the model. We don’t really have a good strategy for incorporating arbitrary features into the default models at the moment, but if you really need the information in the model, I could offer some hacky suggestions.

nix411 · December 12, 2018, 5:02pm

I’m listening for hacky suggestions, thank you. I don’t think I can train a good model without using additional features.

My overall challenge are outlined here. For the training part I am considering chunking every html content into a json line. But for the final model I will keep the whole document into Doc. Does that sound reasonable?

Topic		Replies	Views
Does ner.teach take into account attribute extensions? usage , ner , spacy	3	859	January 12, 2018
Parsing HTML Page ner	1	670	February 27, 2018
Custom HTML template usage	4	1909	March 21, 2019
HTML Source Sentence Boundary Detection Prodigy usage , spacy	4	754	December 2, 2019
Best practice for post processing usage , spacy , solved	3	515	February 8, 2019

Best practice to persist info from HTML attributes

Related topics