Project validation

Wassupkenny · April 18, 2022, 8:25am

I have some insurance files in PDF format.

I want to leverage AWS comprehend.

I cannot annotate PDFs in Prodigy since PDF is not supported. I need to convert them to HTML.

Once I do this, I want to use prodigy with the NER relation process.

Once done with Annotation, I intend to have the output in CSV because that’s what AWS supports.

My questions:

Whatever I described above, does it makes sense or is there another better way to achieve it?
If my pdf has, let’s say, four pages, how many HTML files should I create? 1 or 4?
Any guidance on how to use these annotations outputs in Spacy?
Suggestions for converting PDfs to HTML.

Thanks

ljvmiranda921 · April 19, 2022, 11:05am

Converting them to HTML may be a good idea because you get structured text right off the bat. You can also consider converting the PDFs into images, and then using something like Amazon Textract (since you mentioned AWS) to do the job.
Usually it is fine to have 1 HTML / transitory file per document, but in case you'd need information such as which page X belongs to, or if the text was split into two separate pages, then it might be better to splice it into 4.
I am not sure what the insurance files contain and what your actual task is. I'm going to assume that this is NER because you mentioned AWS Comprehend. Ideally, in spaCy, you'd get decent results if you pass the text itself (remove all the html tags, extraneous information, etc.).
Your mileage may vary, but personally I find tools such as doctr and layoutparser interesting. Some of them doesn't convert directly to HTML, but to a more structured format (like JSON).

Topic		Replies	Views
Prodigy UI Customization usage , front-end	1	621	January 31, 2022
Using prodigy with PDF documents usage	3	4765	February 20, 2018
Legal Documents - Process to read raw PDF and extract paragraphs into jsonl format ner , textcat	6	141	January 14, 2025
Text corpus .txt file to json/spacy format file usage , spacy , solved	5	1316	July 2, 2021
Default Prodigy NER Format to BERT Format	1	316	December 7, 2022