Ideal input length for spaCy model

Hello, I was under the assumption that input for NER models should be about one sentence in length (reinforced by some of the use cases demonstrated in Prodigy demos, as well as comments that annotators should see less rather than more), but this tweet by @honnibal is making me think otherwise. Additionally, spaCy documentation implies that input should be paragraphs. Can anyone explain the significance of this tweet in relation to creating an ideal NER model with Prodigy/spaCy?

I definitely see the potential for confusion there! This is a good question, and something I’ve been thinking about how to improve.

The tricky thing is that there’s a difference between what’s ideal for the spaCy model, and what’s best for the annotation workflow. In Prodigy, we recommend that inputs be kept quite short, because it’s much faster to annotate short pieces of text. It saves you from scrolling, and you can stay in a better flow. Raising the annotation rate and keeping quality high (by avoiding information overload) is useful enough that it’s worth making the pre-processing pipeline a bit more complicated.

In general though, it’s good for the runtime text matches how the data was pre-processed during training. It’s also good if the pre-processing doesn’t introduce any errors, especially sentence segmentation errors. Segmentation errors can easily cause NER mistakes, because named entities often have periods in them, and sometimes punctuation (e.g. Yahoo!) that confuse rule-based segmentation strategies.

If you can divide the text into short paragraphs reliably, that will be a good annotation unit for Prodigy. For instance, if you’re working with tweets you really should just annotate whole tweets. On the other hand, if you’re working with whole articles, you should probably segment into either sentences or paragraphs for the annotation. You can then consider whether you want to put the data back together before training.

Unfortunately the workflow for dividing documents into small pieces for annotation, and then putting them back together for export as a training corpus, is currently a bit lacking in Prodigy. This is something we’re very keen to improve. If you write some scripts for this, we’d be eager to consider adding them to the Prodigy recipes repo.

Got it, thanks for clarifying! Right now I’m leveraging the annotation meta field to keep track of a sentence’s source and position within said source.

Yeah, that’s definitely the best way to do it at the moment, I’m glad you came to the right solution. It’s just that we’d rather have something built-in to do that book-keeping, and to make sure that the right things are the default or most obvious.