What is the best way to recognize document types?

aoliveirahen · November 19, 2019, 5:55pm

Hello! Wish you are doing well!

I have a lot of documents to be recognized: RG, CPF, CNH and others. The first thing that comes up in my mind was to OCR the PDF and train the A.I using NLP instead of computer vision or something like that.

Is it better to use NLP or computer vision to do that task?

I saw that the prodigy has the image.manual. Does it work for that?

And if the best way was to use image classification/computer vision, how many images of that documents will I need?

Thanks in advance!

ines · November 19, 2019, 10:48pm

Hi! This is really difficult to answer and depends on your exact problem, documents etc. I've seen some use cases where framing the problem as a computer vision task and doing OCR as the last step worked better, since a lot of important clues where in the visual formatting (e.g. invoice parsing etc.).

However, for other cases, there's a big advantage in having the raw text and analysing it on the token level. It also lets you use transfer learning and pretrained representations (even just simple word vectors) and use linguistic features in your information extraction pipeline.

Maybe experiment with both and see what works best? There's no easy answer for how many examples you'll need, because this depends on many factors. But you should be prepared to create at least a few hundred, if not a lot more. Don't forget the evaluation data – you'll always want enough evaluation data so you can test your approaches in a stable and reliable way.

aoliveirahen · November 21, 2019, 12:47pm

Okay. Thanks, Ines! I'll test the results in both methods: Computer vision and NLP.

Topic		Replies	Views
Datasheet training on different customer specification usage	1	497	April 14, 2021
Taking a Computer Vision Approach (leveraging image.manual) to build a custom NER model on PDFs usage , ner , image	3	591	July 28, 2022
Document layout analysis usage , image , custom	6	1178	March 10, 2021
Processing CVs / resumes usage , ner , hr	4	1143	December 3, 2019
Possibility for multi-field extraction from OCRd images enhancement , image , front-end	2	705	June 19, 2020

What is the best way to recognize document types?

Related topics