I am working with pdfs. I know how to extract text from pdfs, train a NER or SPAN model, and use them to extract ents or spans.
I now want to train a model to recognise areas of my pdf for downstream processing. The downstream processing will vary.
I know how to annotate a pdf. I have gone through the ljvmiranda approach.
For me, that did reasonably for finding words but did not identify areas at all (even though his example does show areas being found).
I have also done pdf.ocr.correct – so once I have found the area, I can ocr the text.
I am struggling with the workflow to train a model to identify different areas of a pdf eg Header, Footer, Table and then identify them on new pdfs.
Using the ljvmiranda approach the bounding boxes were all tight around the words, even using only annotations of biggish areas for header, footer, table, the model only identified individual words. (I realise one option could be a different HF model. )
I am wondering, before ljvmiranda created this, was there a more “vanilla” workflow for pdfs - - rather like finding paragraphs or figures as shown in https://www.youtube.com/watch?v=rwyze49ne8I but before doing the ocr.
Once have annotations for areas of a pdf, I am looking for a workflow which will:
A: train the model. The equivalent for NER is:
prodigy train ./myNERmodel --ner dataset_name_ner
B. get the model to predict those areas on a new pdf? The equivalent for NER is something like
# First extract the text from the newpdf as jsonl (jsonlfrompdf), then:
nlp = spacy.load(myNERmodel)
Doc = nlp(jsonlfrompdf)
for ent in doc.ents:
print(ent.text, ent.label_)
I am hoping that this would be part of a pipeline whereby we
have a model to identify the relevant area of the pdf eg head, foot, table
head then goes to a model to process head, foot goes to a different model to process foot etc.
This is a great forum, and I really appreciate the answers and code snippets you provide.