Size of the raw text in the source file

aoliveirahen · November 4, 2019, 4:27pm

Hello!
I am a completely beginner in A.I. I have what a think is a basic question:

What size should be the raw text I pass to prodigy?

I am trying to create a A.I that is going to ready a PDF and then it is going to identify the company that has created that document.

Should I pass an entire PDF text as a unique "raw text" or should I colect some raw text from that PDF?

I am asking it because when I run ner.manual, it collect line by line as a raw text and that thought comes up in my mind: Is it better to pass an entire document text in a unique line or should I collect some raw text into that document?

ines · November 4, 2019, 6:48pm

Hi! I just answered a similar question here:

The main thing that's important is that your training and runtime inputs should match. So if you're training on single pargraphs, your model should also be run on single paragraphs. But if you have control over the preprocessing, that's usually no problem.

aoliveirahen · November 4, 2019, 8:46pm

Thanks a lot, Ines!

Topic		Replies	Views
Splitting bigger documents for NER usage , ner , best-practices	1	942	March 30, 2022
Size of context window for NLP	4	26	October 14, 2024
Information Extraction for long, semi-unique documents ner	1	534	October 16, 2019
NER best practice: long paragraphs or sentences usage , ner	1	2383	May 14, 2020
Best annotation strategy for NER usage , ner	1	657	November 4, 2019

Size of the raw text in the source file

Related topics