Project validation

I have some insurance files in PDF format.

I want to leverage AWS comprehend.

I cannot annotate PDFs in Prodigy since PDF is not supported. I need to convert them to HTML.

Once I do this, I want to use prodigy with the NER relation process.

Once done with Annotation, I intend to have the output in CSV because that’s what AWS supports.

My questions:

  1. Whatever I described above, does it makes sense or is there another better way to achieve it?

  2. If my pdf has, let’s say, four pages, how many HTML files should I create? 1 or 4?

  3. Any guidance on how to use these annotations outputs in Spacy?

  4. Suggestions for converting PDfs to HTML.


Hi @Wassupkenny ,

  1. Converting them to HTML may be a good idea because you get structured text right off the bat. You can also consider converting the PDFs into images, and then using something like Amazon Textract (since you mentioned AWS) to do the job.
  2. Usually it is fine to have 1 HTML / transitory file per document, but in case you'd need information such as which page X belongs to, or if the text was split into two separate pages, then it might be better to splice it into 4.
  3. I am not sure what the insurance files contain and what your actual task is. I'm going to assume that this is NER because you mentioned AWS Comprehend. Ideally, in spaCy, you'd get decent results if you pass the text itself (remove all the html tags, extraneous information, etc.).
  4. Your mileage may vary, but personally I find tools such as doctr and layoutparser interesting. Some of them doesn't convert directly to HTML, but to a more structured format (like JSON).