Ideal input length for spaCy model

zach · November 27, 2018, 8:41pm

Hello, I was under the assumption that input for NER models should be about one sentence in length (reinforced by some of the use cases demonstrated in Prodigy demos, as well as comments that annotators should see less rather than more), but this tweet by @honnibal is making me think otherwise. Additionally, spaCy documentation implies that input should be paragraphs. Can anyone explain the significance of this tweet in relation to creating an ideal NER model with Prodigy/spaCy?

honnibal · November 28, 2018, 1:45pm

I definitely see the potential for confusion there! This is a good question, and something I’ve been thinking about how to improve.

The tricky thing is that there’s a difference between what’s ideal for the spaCy model, and what’s best for the annotation workflow. In Prodigy, we recommend that inputs be kept quite short, because it’s much faster to annotate short pieces of text. It saves you from scrolling, and you can stay in a better flow. Raising the annotation rate and keeping quality high (by avoiding information overload) is useful enough that it’s worth making the pre-processing pipeline a bit more complicated.

In general though, it’s good for the runtime text matches how the data was pre-processed during training. It’s also good if the pre-processing doesn’t introduce any errors, especially sentence segmentation errors. Segmentation errors can easily cause NER mistakes, because named entities often have periods in them, and sometimes punctuation (e.g. Yahoo!) that confuse rule-based segmentation strategies.

If you can divide the text into short paragraphs reliably, that will be a good annotation unit for Prodigy. For instance, if you’re working with tweets you really should just annotate whole tweets. On the other hand, if you’re working with whole articles, you should probably segment into either sentences or paragraphs for the annotation. You can then consider whether you want to put the data back together before training.

Unfortunately the workflow for dividing documents into small pieces for annotation, and then putting them back together for export as a training corpus, is currently a bit lacking in Prodigy. This is something we’re very keen to improve. If you write some scripts for this, we’d be eager to consider adding them to the Prodigy recipes repo.

zach · November 28, 2018, 1:54pm

Got it, thanks for clarifying! Right now I’m leveraging the annotation meta field to keep track of a sentence’s source and position within said source.

honnibal · November 28, 2018, 1:56pm

Yeah, that’s definitely the best way to do it at the moment, I’m glad you came to the right solution. It’s just that we’d rather have something built-in to do that book-keeping, and to make sure that the right things are the default or most obvious.

Topic		Replies	Views
Recommended text length for training NER models in spaCy usage , ner , spacy , training	10	2996	June 13, 2022
Is there a limitation for string length for NER spacy models? usage , ner , spacy	1	1499	October 31, 2018
Prodigy sentence splitting during ner.correct usage , ner , spacy	3	430	February 24, 2021
Sentence Segmentation and Annotations usage , spacy , legal	2	1545	January 23, 2020
Best way to prepare a long text for annotations usage , spacy , solved	4	2142	August 29, 2018

Ideal input length for spaCy model

Related topics