What is the best way to recognize document types?

Hello! Wish you are doing well!

What is the best way to recognize document types?

I have a lot of documents to be recognized: RG, CPF, CNH and others. The first thing that comes up in my mind was to OCR the PDF and train the A.I using NLP instead of computer vision or something like that.

Is it better to use NLP or computer vision to do that task?

I saw that the prodigy has the image.manual. Does it work for that?

And if the best way was to use image classification/computer vision, how many images of that documents will I need?

Thanks in advance!

Hi! This is really difficult to answer and depends on your exact problem, documents etc. I've seen some use cases where framing the problem as a computer vision task and doing OCR as the last step worked better, since a lot of important clues where in the visual formatting (e.g. invoice parsing etc.).

However, for other cases, there's a big advantage in having the raw text and analysing it on the token level. It also lets you use transfer learning and pretrained representations (even just simple word vectors) and use linguistic features in your information extraction pipeline.

Maybe experiment with both and see what works best? There's no easy answer for how many examples you'll need, because this depends on many factors. But you should be prepared to create at least a few hundred, if not a lot more. Don't forget the evaluation data – you'll always want enough evaluation data so you can test your approaches in a stable and reliable way.

Okay. Thanks, Ines! I'll test the results in both methods: Computer vision and NLP.