Splitting bigger documents for NER

Hi,

My team is using Prodigy to label and train a Named Entity Recognizer, and we have a couple of questions about it. We have documents, 5-20 pages, we would use around 8 new Entities and a LOC Entity from de_core_news_md. We are wondering:

  1. Should we split the documents? If yes, what is the best practice to do so? There is around 1-3 appearances of each label in the whole document. Should we split documents by page or even by paragraph?

  2. How should we prepare (split) documents then, when we use doc = nlp(text) to make predictions? Same way, like in training?

  3. If we are going to use a pretrain model, what is the best recipe to start with?

  4. There is no recipe to evaluate the quality of models in prodigy with evaluation data set. Am I right? So we should use

spacy evaluate model ? Evaluation data in the binary .spacy format. We can use

spacy convert input_file to convert Datat to binary from prodigy db-out command. Is it right, or is there another, better way to do it?

Thank you!

Yes, in general, I'd recommend looking into that. There's not really an advantage of annotating whole documents as one because it's less efficient for the annotator and the model will only take the local context into account anyway, so you might as well be annotating sentences or paragraphs.

How you do it kinda depends on your data and what makes the most sense. You could use a simple preprocessing script that splits on, say, \n\n to create the paragraphs, or use spaCy's doc.sents (either via the dependency parse, a trained senter component or a simpler, rule-based sentencizer component) to split into sentences. You might just want to experiment here to see what works best.

Yes, with any preprocessing you're doing, it's usually a good idea to run the same preprocessing during annotation and at runtime.

Yes, spacy evaluate already does exactly what you want here so there's no need for Prodigy to provide its own evaluation command that's just a copy of that.

In Prodigy, you want to be using data-to-spacy to export your annotations (training and evaluation data) as a .spacy files. You can then train and evaluate with spaCy directly if you want.