I have around 200 customers and the datasheet from each customer is different (either scanned pdf or text pdf). I am trying to create a pre-trained model for this activity there are challenges as it is tabular data and annotating is difficult. Is there any known models for this topic
Hi! It's difficult to give good advice here because it really depends on the documents, the types of data you want to extract and the features you want to use in your model. This will also inform how you set up the annotation task in the end. If you're working with regular text, a pipeline of OCR / text extraction plus a regular NLP model, e.g. a text classifier, may work fine. If you're working with PDFs with more complex layouts, framing it as a computer vision task may be a better option.
For some general discussion around working with text-based images, you might also find this thread interesting: