I have some insurance files in PDF format.
I want to leverage AWS comprehend.
I cannot annotate PDFs in Prodigy since PDF is not supported. I need to convert them to HTML.
Once I do this, I want to use prodigy with the NER relation process.
Once done with Annotation, I intend to have the output in CSV because that’s what AWS supports.
My questions:
-
Whatever I described above, does it makes sense or is there another better way to achieve it?
-
If my pdf has, let’s say, four pages, how many HTML files should I create? 1 or 4?
-
Any guidance on how to use these annotations outputs in Spacy?
-
Suggestions for converting PDfs to HTML.
Thanks