Best practice to persist info from HTML attributes

I want to use spaCy and prodigy for fact extraction. One of the subproblems of this is to identify subheadlines in my html document.

I will probably apply a mix of rule based and statistical approach for this. Sometimes the headlines are bold, sometimes not etc… My question is, what is the best way to transfer the html attributes to Doc? I would like to have a clean Doc with no HTML except as metadata attached to Tokens or Span.

I think extension attributes are probably what you’re looking for: https://spacy.io/usage/processing-pipelines#section-custom-components-attributes

Note that out of the box, these extension attributes won’t be used as features in the model. We don’t really have a good strategy for incorporating arbitrary features into the default models at the moment, but if you really need the information in the model, I could offer some hacky suggestions.

I’m listening for hacky suggestions, thank you. I don’t think I can train a good model without using additional features.

My overall challenge are outlined here. For the training part I am considering chunking every html content into a json line. But for the final model I will keep the whole document into Doc. Does that sound reasonable?