I've made a parser that segments html into a list of paragraphs (if you see a new line on the HTML page then you get a new paragraph). Is that something you could use? However note what Matthew Honnibal said
I have this exact issue. I actually want some of the non-text aspects as features but I don't know how to achieve that in a spaCy framework. I've laid out my challenge in another thread.