I haven’t personally built a system which did that, but the idea definitely makes good sense to me. Document layout always varies in ways that are specific to the text you’re dealing with, so you’ll benefit from doing some custom work to clean your data, and exploit the regularities. You might want to customise the Prodigy recipe to accommodate this. You can find custom recipe templates in this repo, if you haven’t seen them already: https://github.com/explosion/prodigy-recipes