Datasheet training on different customer specification

ines · April 14, 2021, 12:19am

Hi! It's difficult to give good advice here because it really depends on the documents, the types of data you want to extract and the features you want to use in your model. This will also inform how you set up the annotation task in the end. If you're working with regular text, a pipeline of OCR / text extraction plus a regular NLP model, e.g. a text classifier, may work fine. If you're working with PDFs with more complex layouts, framing it as a computer vision task may be a better option.

For some general discussion around working with text-based images, you might also find this thread interesting:

Topic		Replies	Views
Annotation strategy for varied pdf layouts	8	182	August 29, 2024
Pretrain Model to extract data from PDFs using .jsonl data	5	688	May 9, 2024
Image segmentation (bounding boxes) for textual images image	9	3242	March 29, 2021
Legal Documents - Process to read raw PDF and extract paragraphs into jsonl format ner , textcat	6	367	January 14, 2025
finding areas on pdfs for downstream training	3	130	July 19, 2024

Datasheet training on different customer specification

Related topics