Datasheet training on different customer specification

Gemma · April 12, 2021, 11:44am

I have around 200 customers and the datasheet from each customer is different (either scanned pdf or text pdf). I am trying to create a pre-trained model for this activity there are challenges as it is tabular data and annotating is difficult. Is there any known models for this topic

ines · April 14, 2021, 12:19am

Hi! It's difficult to give good advice here because it really depends on the documents, the types of data you want to extract and the features you want to use in your model. This will also inform how you set up the annotation task in the end. If you're working with regular text, a pipeline of OCR / text extraction plus a regular NLP model, e.g. a text classifier, may work fine. If you're working with PDFs with more complex layouts, framing it as a computer vision task may be a better option.

For some general discussion around working with text-based images, you might also find this thread interesting:

Topic		Replies	Views
Annotation strategy for varied pdf layouts	8	120	August 29, 2024
Pretrain Model to extract data from PDFs using .jsonl data	5	611	May 9, 2024
Image segmentation (bounding boxes) for textual images image	9	3019	March 29, 2021
Legal Documents - Process to read raw PDF and extract paragraphs into jsonl format ner , textcat	6	219	January 14, 2025
finding areas on pdfs for downstream training	3	97	July 19, 2024

Datasheet training on different customer specification

Related topics