Best practice to persist info from HTML attributes

I’m listening for hacky suggestions, thank you. I don’t think I can train a good model without using additional features.

My overall challenge are outlined here. For the training part I am considering chunking every html content into a json line. But for the final model I will keep the whole document into Doc. Does that sound reasonable?