Classify sentences with paragraph visible

Hi there!

I'm working on a classification task that would classify text from court opinions into two classes of legal interpretation, at the sentence level. The ideal setup would present an annotator with a paragraph (4-5 sentences long), and then they could select each consecutive sentence and choose a label for the class it falls within.

The prodigy setup I'm most familiar with usually only presents the annotator with the single span of text to which they are assigning a label, so in my case, this would just show an annotator a sentence and not a paragraph of sentences. For a few reasons, for my project, it's important that annotators assign a sentence a label while being able to read the entire paragraph. I've played around with a few recommendations on this support page (like Marking sentences for classification) that reframe this as a span categorization task, including Ines' recommendation below:

Alternatively, a similar use case was posted on the forum a while ago and they ended up using the manual span interface, but feeding in data with a "tokens" property, but with one sentence as a "token". This would let you view the sentences in their natural flow, and you could double-click on them to select them, and even assign them different labels.

By feeding in sentences as tokens and using the spans.manual recipe, I'm able to get close to the setup I'd like, but there are a couple of bugs. When I highlight a sentence and assign it a label, all sentences automatically receive that same label -- even though I didn't select them (shown in the first screenshot). I haven't been able to assign sentences different labels, which is what I need.

If I highlight only part of the sentence, I get the below error message. I don't want labels for parts of sentences, but I'm worried that the fact that this results in an error means there's a bigger issue with how I've set this up. I'm including this picture in case it adds information as to what I've done wrong:

My guess is that the root of the issue is in how I've structured the input json. I'd appreciate tips on how to structure the input file so that the paragraph is visible but labels are assigned to sentences. Please also let me know if there's a different approach that might work better.

Thank you!

hi @rosamond!

Thanks for your question and welcome to the Prodigy community :wave:

Thanks for the excellent background! That helped me a lot.

Would something like this work?

textcat_sent_sequence

It's a custom recipe that loops through highlighting each sentence for binary classification within context (paragraph).

It has two user experience designs to enable faster annotations. First, since it's framed as a binary task, you can quickly annotate with key bindings (no mouse or span highlighting required):

  • A to accept (positive)
  • X to reject (negative)

Second, if your input data is ordered, it may enable even faster as you annotate likely how the document was intended to be read. You could even provide more than one paragraph at a time (e.g., two or three).

I wrote up a GitHub gist that documents the steps:

Input data: paragraphs

{"paragraph": "Prodigy is a modern annotation tool for creating training and evaluation data for machine learning models. You can also use Prodigy to help you inspect and clean your data, do error analysis and develop rule-based systems to use in combination with your statistical models."}
{"paragraph": "The Python library includes a range of pre-built workflows and command-line commands for various tasks, and well-documented components for implementing your own workflow scripts. Your scripts can specify how the data is loaded and saved, change which questions are asked in the annotation interface, and can even define custom HTML and JavaScript to change the behavior of the front-end. The web application is optimized for fast, intuitive and efficient annotation."}
{"paragraph": "Prodigy’s mission is to help you do more of all those manual or semi-automatic processes that we all know we don’t do enough of. To most data scientists, the advice to spend more time looking at your data is sort of like the advice to floss or get more sleep: it’s easy to recognize that it’s sound advice, but not always easy to put into practice. Prodigy helps by giving you a practical, flexible tool that fits easily into your workflow. With concrete steps to follow instead of a vague goal, annotation and data inspection will change from something you should do, to something you will do."}

Run Prodigy Recipe

python -m prodigy textcat_sent_sequence sent_dataset input_paragraphs.jsonl en_core_web_sm MY_LABEL -F textcat_sent_sequence.py

Annotation output (first 2 examples)

If you like this recipe, I'd recommend in the recipe to change the key "sentence" to "text", as that's what Prodigy expects. I left "sentence" to be explicit on the difference between sentences and paragraphs.

{
  "paragraph": "Prodigy is a modern annotation tool for creating training and evaluation data for machine learning models. You can also use Prodigy to help you inspect and clean your data, do error analysis and develop rule-based systems to use in combination with your statistical models.",
  "sentence": "Prodigy is a modern annotation tool for creating training and evaluation data for machine learning models.",
  "label": "MY_LABEL",
  "_input_hash": -1908851693,
  "_task_hash": -2083698072,
  "_view_id": "classification",
  "answer": "accept",
  "_timestamp": 1674846351
}
{
  "paragraph": "Prodigy is a modern annotation tool for creating training and evaluation data for machine learning models. You can also use Prodigy to help you inspect and clean your data, do error analysis and develop rule-based systems to use in combination with your statistical models.",
  "sentence": "You can also use Prodigy to help you inspect and clean your data, do error analysis and develop rule-based systems to use in combination with your statistical models.",
  "label": "MY_LABEL",
  "_input_hash": -1655152078,
  "_task_hash": 1852110735,
  "_view_id": "classification",
  "answer": "reject",
  "_timestamp": 1674846353
}

With the Prodigy docs on custom recipes or prodigy-recipes repo, you should be able to generalize it from binary to multi-class or multi-label. For example, this is from textcat_manual:

    #Add labels to each task in stream
    has_options = len(label) > 1
    if has_options:
        stream = add_label_options_to_stream(stream, label)
    else:
        stream = add_labels_to_stream(stream, label)

Let me know if this works or if you're able to make progress!

3 Likes

Hi @ryanwesslen! Thank you so much for your help! This is exactly what I was imagining. I'll add an update to this thread once I've implemented your recommendation, but I really appreciate the quick support and solution.

@ryanwesslen thank you again, this ended up working flawlessly! Your custom recipe was easy to adapt to a multi-class task. I appreciate it!

Thanks to you and the whole Prodigy team!