prodigy splitting sentences for annotation

Hi! The ner.manual recipe doesn't split sentences – that's why the split_sents_threshold has no effect here. There are two options to split your texts:

  1. Update the recipe in recipes/ner.py (or use this template to write your own custom version of ner.manual). The split_sentences preprocessor takes an nlp object that can split sentences (either with a parser or the sentencizer) and your stream. For example:
from prodigy.components.preprocess import split_sentences

# after your stream is loaded etc.
stream = split_sentences(nlp, stream)
  1. Preprocess your JSONL file in Python and use spaCy to split sentences. Then save it to a new file and use that in Prodigy. For example:
import spacy
import srsly  # to easily read/write JSONL etc.

nlp = spacy.load("en_core_web_sm")  # or whatever you need
examples = srsly.read_jsonl("./Lease-7.jsonl")
texts = (eg["text"] for eg in examples)

new_examples = []
for doc in nlp.pipe(texts):
    for sent in doc.sents:
        new_examples.append({"text": sent.text})
srsly.write_jsonl("./Lease-7-with-sentences.jsonl", new_examples)