I would like to know how experts perform in NLP workflow.
I have a project with pdf files. In each document, I would like to perform a NER extraction on the name and reason for resign. In order to have the training set, I have use my own code to separate the pdf into sentences (with the use of spacy) and put each sentence into prodigy for labeling and training.
My question is
1.) Should I use a long paragraph/page instead of sentence for labeling? As some of the sentences are not complete sentences.
2.) Should I use long paragraph/ page to run with the model (mostly trained by sentences not long paragraph/ page).
Hi! It's important that the examples your model sees during training are similar to the examples the model will see at runtime. So if you want to run your model over sentences, you should train it on sentences and then it also makes more sense to annotate sentences.
So as long as it's consistent, it doesn't matter that much whether you're using longer paragraphs or shorter sentences. We typically recommend annotating shorter texts because they're quicker to read/scan and you collect more datapoints overall. If you're annotating data for NER, this also makes it more obvious when a narrow context window makes it difficult to make the annotation decision. If the annotator struggles with this, the model is also less likely to make the distinction. (You can read more on this here.) So if you can, I'd say going with shorter segments is better.