Sentence-based classification: Automated sentence splitting?

I know there is a built-in Sentence splitter for NER tasks. We would like to classify individual sentences (binary). As text input we have large texts (10-20 sentences). Now we would like to automatically display only one sentence from the text at a time and apply classification. Is this possible? Or do we need to apply preprocessing in order to create sentences as Text input in the input CSV.

Yes, this should be easy to implement. Prodigy actually comes with a split_sentences function that wraps and preprocesses your stream. You can see an example of this in the source of the NER recipes, e.g. ner.teach. So in your text classification recipe, you can do the following:

from prodigy.components.preprocess import split_sentences

# before you return the recipe components
stream = split_sentences(nlp, stream)

This will take the incoming stream of examples, use the spaCy model and its dependency parser to split the texts into sentences and yield them as individual annotation examples. Using the dependency parse is often more accurate than a rule based approach – but depending on your data, you can also customise the sentence segmentation strategy.

1 Like

Thank you. Because of your hint I managed to create ugly but so-far working code, which splits my input in sentences.

import csv
import prodigy
from prodigy.components.preprocess import split_sentences
from prodigy.components.loaders import CSV
from import teach
import spacy

    dataset=('Dataset ID', 'positional', None, str),
    view_id=('Annotation interface', 'option', 'v', str))

def sentencer(dataset, label, exclude, view_id='text'):
    # load your own streams from anywhere you want
    stream = custom_csv_loader("/uni/XY.csv")

    def update(examples):
        # this function is triggered when Prodigy receives annotations
        print("Received {} annotations!".format(len(examples)))

    nlp = spacy.load('de_core_news_sm')
    stream = split_sentences(nlp, stream)

    components = teach(dataset=dataset, spacy_model='de_core_news_sm',
                       source=stream, label=label, exclude=exclude)


def custom_csv_loader(file_path):
    with open(file_path) as csvfile:
        reader = csv.DictReader(csvfile)
        for row in reader:
            text = row.get('Text')
            yield {'text': text, 'meta': {'original':text}}

I start prodigy via:
python3.5 -m prodigy sentencer XY --label breach -F --exclude XY

This actually looks pretty neat :+1:

Do you mean within the same session, or when you exit the server and start again?

It was a bug in the old version I posted (and edited after few minutes - you are just toooo fast :D). It is working fine now. I forgot to pass the exclude parameter to the component I return.

Thank you!

1 Like

Ahh sorry :stuck_out_tongue: Glad it’s working now!