Hi! The ner.manual
recipe doesn't split sentences – that's why the split_sents_threshold
has no effect here. There are two options to split your texts:
- Update the recipe in
recipes/ner.py
(or use this template to write your own custom version ofner.manual
). Thesplit_sentences
preprocessor takes annlp
object that can split sentences (either with a parser or the sentencizer) and your stream. For example:
from prodigy.components.preprocess import split_sentences
# after your stream is loaded etc.
stream = split_sentences(nlp, stream)
- Preprocess your JSONL file in Python and use spaCy to split sentences. Then save it to a new file and use that in Prodigy. For example:
import spacy
import srsly # to easily read/write JSONL etc.
nlp = spacy.load("en_core_web_sm") # or whatever you need
examples = srsly.read_jsonl("./Lease-7.jsonl")
texts = (eg["text"] for eg in examples)
new_examples = []
for doc in nlp.pipe(texts):
for sent in doc.sents:
new_examples.append({"text": sent.text})
srsly.write_jsonl("./Lease-7-with-sentences.jsonl", new_examples)