Combining Document Layout Analysis with NLP


Imagine I have a HTML report where a lot of the paragraphs, tables and footnotes are just noise.

I am thinking of a pipeline where I first do a document layout analysis and then parse the phrases of interest with spaCy.

Have you had any experience combining these methods in spaCy and using prodigy for annotating data for both pipelines?

I haven’t personally built a system which did that, but the idea definitely makes good sense to me. Document layout always varies in ways that are specific to the text you’re dealing with, so you’ll benefit from doing some custom work to clean your data, and exploit the regularities. You might want to customise the Prodigy recipe to accommodate this. You can find custom recipe templates in this repo, if you haven’t seen them already: