breaking down texts to sentences for textcat

Hello, a short design question. I have text data that consists of possibly several sentences for each text, each text also having a unique ID. I want to a annotate / train / predict labels on the sentences of these texts, preserving the connection of the sentence to the ID of the text where it came from. I can think of two ways doing this:

  1. start spacy, add "sentencizer", make a new ID for each sentence (consisting of original ID and sentence number), write new database, start prodigy for the textcat tasks on that database, or
  2. use prodigy recipe, based on the specific textcat task that *somehow includes the sentencizer and the new ID steps in one pipeline

If you say 2 is possible then I'll try to figure that out. But I'm skeptical because I don't have an idea how to include the renaming of the ID into the pipeline. What do you think?

hi @rwst!

Thanks for your question and welcome to the Prodigy community :wave:

Both of your proposed methods are possible, but they would each have their own advantages and disadvantages.

  1. The first method you proposed, using spaCy's sentencizer to split the text into sentences and then creating a new ID for each sentence, is a straightforward and effective approach. This method would allow you to easily track which sentences belong to which original text. However, you would have to manage a separate database for these sentence IDs, which might add some complexity to your workflow. If it's helpful, you may want to preserve your sentence ID as meta data (e.g., structure each sentence with "meta": {"doc_id": 1}that can be displayed.
  2. The second method, creating a custom Prodigy recipe that includes the sentencizer and ID renaming steps, is a more complex but potentially more streamlined approach. You could use a custom recipe to automate the entire process, which would save you some manual work. However, creating a custom recipe would require more initial setup, and you would need to be comfortable with Python programming.

In a custom recipe, you could include a step where you add a new field to each example that includes the original text ID and the sentence number. You can use Prodigy's built-in split_sentences. This would allow you to track which sentences belong to which original text. Here's a rough example of what that might look like:

def add_sentence_ids_to_stream(stream, text_id):
    for i, example in enumerate(split_sentences(nlp, stream)):
        example["meta"]["sentence_id"] = f"{text_id}-{i}"
        yield example

You could then use this function in your recipe like so:

@prodigy.recipe("textcat.sentences")
def textcat_sentences(dataset, source):
    stream = prodigy.get_stream(source, rehash=True, dedup=True)
    stream = add_sentence_ids_to_stream(stream, text_id)
    # rest of your recipe here...

This is a very basic example and you'd likely need to adapt it to fit your specific needs, but hopefully it gives you an idea of what's possible.

Another project you should look into it -- it may not be a solution, but it'll definitely be enlightening -- is my colleague @koaning's approach in his frontpage project.

He described his workflow as "rethinking parts of it in his PyData Ams Keynote:

Another older post can show how the custom recipe route can also improve UX for annotators. Namely, you still do sentence classification but provide the full paragraph to the user so the annotator can see the paragraph context.

textcat_sent_sequence

Definitely check out the Prodigy documentation on custom recipes, if you haven't already. It provides a lot of helpful information and examples that can guide you in creating your own custom recipe.

Hope this helps!

1 Like

Thanks @ryanwesslen . The answer leaves no immediate questions, very helpful. Will follow up when I've put all pieces together.