Combining Document Layout Analysis with NLP

nix411 · February 21, 2019, 4:51pm

Hi,

Imagine I have a HTML report where a lot of the paragraphs, tables and footnotes are just noise.

I am thinking of a pipeline where I first do a document layout analysis and then parse the phrases of interest with spaCy.

Have you had any experience combining these methods in spaCy and using prodigy for annotating data for both pipelines?

honnibal · February 26, 2019, 10:09am

I haven’t personally built a system which did that, but the idea definitely makes good sense to me. Document layout always varies in ways that are specific to the text you’re dealing with, so you’ll benefit from doing some custom work to clean your data, and exploit the regularities. You might want to customise the Prodigy recipe to accommodate this. You can find custom recipe templates in this repo, if you haven’t seen them already: https://github.com/explosion/prodigy-recipes

Topic		Replies	Views
Document layout analysis usage , image , custom	6	1165	March 10, 2021
pdf.spans.manual	1	62	December 2, 2024
Custom HTML template usage	4	1903	March 21, 2019
Annotating scanned documents for data extraction	2	233	August 12, 2022
HTML Source Sentence Boundary Detection Prodigy usage , spacy	4	751	December 2, 2019

Combining Document Layout Analysis with NLP

Related topics