Sentence-based classification: Automated sentence splitting?

baeumer · June 14, 2018, 11:34am

I know there is a built-in Sentence splitter for NER tasks. We would like to classify individual sentences (binary). As text input we have large texts (10-20 sentences). Now we would like to automatically display only one sentence from the text at a time and apply classification. Is this possible? Or do we need to apply preprocessing in order to create sentences as Text input in the input CSV.

ines · June 14, 2018, 11:42am

Yes, this should be easy to implement. Prodigy actually comes with a split_sentences function that wraps and preprocesses your stream. You can see an example of this in the source of the NER recipes, e.g. ner.teach. So in your text classification recipe, you can do the following:

from prodigy.components.preprocess import split_sentences

# before you return the recipe components
stream = split_sentences(nlp, stream)

This will take the incoming stream of examples, use the spaCy model and its dependency parser to split the texts into sentences and yield them as individual annotation examples. Using the dependency parse is often more accurate than a rule based approach – but depending on your data, you can also customise the sentence segmentation strategy.

baeumer · June 14, 2018, 9:33pm

Thank you. Because of your hint I managed to create ugly but so-far working code, which splits my input in sentences.

import csv
import prodigy
from prodigy.components.preprocess import split_sentences
from prodigy.components.loaders import CSV
from prodigy.recipes.textcat import teach
import spacy

@prodigy.recipe('sentencer',
    dataset=('Dataset ID', 'positional', None, str),
    label=prodigy.recipe_args['label'],
    exclude=prodigy.recipe_args['exclude'],
    view_id=('Annotation interface', 'option', 'v', str))

def sentencer(dataset, label, exclude, view_id='text'):
    # load your own streams from anywhere you want
    stream = custom_csv_loader("/uni/XY.csv")
    print(label)

    def update(examples):
        # this function is triggered when Prodigy receives annotations
        print("Received {} annotations!".format(len(examples)))

    nlp = spacy.load('de_core_news_sm')
    stream = split_sentences(nlp, stream)

    components = teach(dataset=dataset, spacy_model='de_core_news_sm',
                       source=stream, label=label, exclude=exclude)

    return(components)


def custom_csv_loader(file_path):
    with open(file_path) as csvfile:
        reader = csv.DictReader(csvfile)
        for row in reader:
            text = row.get('Text')
            yield {'text': text, 'meta': {'original':text}}

I start prodigy via:
python3.5 -m prodigy sentencer XY --label breach -F sentencer.py --exclude XY

ines · June 14, 2018, 9:40pm

This actually looks pretty neat

Do you mean within the same session, or when you exit the server and start again?

baeumer · June 14, 2018, 9:51pm

It was a bug in the old version I posted (and edited after few minutes - you are just toooo fast :D). It is working fine now. I forgot to pass the exclude parameter to the component I return.

Thank you!

ines · June 14, 2018, 10:01pm

Ahh sorry Glad it’s working now!

Topic		Replies	Views
Prodigy sentence splitting during ner.correct usage , ner , spacy	3	428	February 24, 2021
prodigy splitting sentences for annotation enhancement , usage , done	14	3457	December 12, 2019
breaking down texts to sentences for textcat textcat , best-practices	2	335	December 13, 2023
Sentencize already annotated data usage , spacy , solved , training	2	506	January 4, 2022
Split a ner.manual dataset, into smaller texts usage , ner , spacy	3	1139	June 24, 2022

Sentence-based classification: Automated sentence splitting?

Related topics