prodigy splitting sentences for annotation

ines · December 9, 2019, 5:04pm

Hi! The ner.manual recipe doesn't split sentences – that's why the split_sents_threshold has no effect here. There are two options to split your texts:

Update the recipe in recipes/ner.py (or use this template to write your own custom version of ner.manual). The split_sentences preprocessor takes an nlp object that can split sentences (either with a parser or the sentencizer) and your stream. For example:

from prodigy.components.preprocess import split_sentences

# after your stream is loaded etc.
stream = split_sentences(nlp, stream)

Preprocess your JSONL file in Python and use spaCy to split sentences. Then save it to a new file and use that in Prodigy. For example:

import spacy
import srsly  # to easily read/write JSONL etc.

nlp = spacy.load("en_core_web_sm")  # or whatever you need
examples = srsly.read_jsonl("./Lease-7.jsonl")
texts = (eg["text"] for eg in examples)

new_examples = []
for doc in nlp.pipe(texts):
    for sent in doc.sents:
        new_examples.append({"text": sent.text})
srsly.write_jsonl("./Lease-7-with-sentences.jsonl", new_examples)

Topic		Replies	Views
How to split the paragraph into sentences after annotation ner	3	598	November 20, 2022
Strange text segmentation with ner.teach recipe usage	7	596	September 9, 2019
Prodigy sentence splitting during ner.correct usage , ner , spacy	3	428	February 24, 2021
Partially Fixed: ner.batch-train's split_sentences does not properly handle tokens and spans ner , done	1	504	October 1, 2018
Sentence-based classification: Automated sentence splitting? usage , textcat , spacy , solved	5	1835	June 14, 2018

prodigy splitting sentences for annotation

Related topics