breaking down texts to sentences for textcat

rwst · December 12, 2023, 6:16pm

Hello, a short design question. I have text data that consists of possibly several sentences for each text, each text also having a unique ID. I want to a annotate / train / predict labels on the sentences of these texts, preserving the connection of the sentence to the ID of the text where it came from. I can think of two ways doing this:

start spacy, add "sentencizer", make a new ID for each sentence (consisting of original ID and sentence number), write new database, start prodigy for the textcat tasks on that database, or
use prodigy recipe, based on the specific textcat task that *somehow includes the sentencizer and the new ID steps in one pipeline

If you say 2 is possible then I'll try to figure that out. But I'm skeptical because I don't have an idea how to include the renaming of the ID into the pipeline. What do you think?

ryanwesslen · December 12, 2023, 6:42pm

hi @rwst!

Thanks for your question and welcome to the Prodigy community

Both of your proposed methods are possible, but they would each have their own advantages and disadvantages.

The first method you proposed, using spaCy's sentencizer to split the text into sentences and then creating a new ID for each sentence, is a straightforward and effective approach. This method would allow you to easily track which sentences belong to which original text. However, you would have to manage a separate database for these sentence IDs, which might add some complexity to your workflow. If it's helpful, you may want to preserve your sentence ID as meta data (e.g., structure each sentence with "meta": {"doc_id": 1}that can be displayed.
The second method, creating a custom Prodigy recipe that includes the sentencizer and ID renaming steps, is a more complex but potentially more streamlined approach. You could use a custom recipe to automate the entire process, which would save you some manual work. However, creating a custom recipe would require more initial setup, and you would need to be comfortable with Python programming.

In a custom recipe, you could include a step where you add a new field to each example that includes the original text ID and the sentence number. You can use Prodigy's built-in split_sentences. This would allow you to track which sentences belong to which original text. Here's a rough example of what that might look like:

def add_sentence_ids_to_stream(stream, text_id):
    for i, example in enumerate(split_sentences(nlp, stream)):
        example["meta"]["sentence_id"] = f"{text_id}-{i}"
        yield example

You could then use this function in your recipe like so:

@prodigy.recipe("textcat.sentences")
def textcat_sentences(dataset, source):
    stream = prodigy.get_stream(source, rehash=True, dedup=True)
    stream = add_sentence_ids_to_stream(stream, text_id)
    # rest of your recipe here...

This is a very basic example and you'd likely need to adapt it to fit your specific needs, but hopefully it gives you an idea of what's possible.

Another project you should look into it -- it may not be a solution, but it'll definitely be enlightening -- is my colleague @koaning's approach in his frontpage project.

He described his workflow as "rethinking parts of it in his PyData Ams Keynote:

Another older post can show how the custom recipe route can also improve UX for annotators. Namely, you still do sentence classification but provide the full paragraph to the user so the annotator can see the paragraph context.

textcat_sent_sequence

Definitely check out the Prodigy documentation on custom recipes, if you haven't already. It provides a lot of helpful information and examples that can guide you in creating your own custom recipe.

Hope this helps!

rwst · December 13, 2023, 10:29am

Thanks @ryanwesslen . The answer leaves no immediate questions, very helpful. Will follow up when I've put all pieces together.

Topic		Replies	Views
Sentencize already annotated data usage , spacy , solved , training	2	506	January 4, 2022
Access to/manipulate sent.cat within TextClassifier class? usage , textcat , spacy	4	945	February 21, 2019
Sentence-based classification: Automated sentence splitting? usage , textcat , spacy , solved	5	1834	June 14, 2018
Apply textcat only to sentences of a document? usage , textcat , spacy , solved , off-topic	1	497	June 16, 2020
textcat by sentence given context of larger document textcat	1	782	March 1, 2018

breaking down texts to sentences for textcat

Related topics