Classify sentences with paragraph visible

rosamond · January 27, 2023, 5:41pm

Hi there!

I'm working on a classification task that would classify text from court opinions into two classes of legal interpretation, at the sentence level. The ideal setup would present an annotator with a paragraph (4-5 sentences long), and then they could select each consecutive sentence and choose a label for the class it falls within.

The prodigy setup I'm most familiar with usually only presents the annotator with the single span of text to which they are assigning a label, so in my case, this would just show an annotator a sentence and not a paragraph of sentences. For a few reasons, for my project, it's important that annotators assign a sentence a label while being able to read the entire paragraph. I've played around with a few recommendations on this support page (like Marking sentences for classification) that reframe this as a span categorization task, including Ines' recommendation below:

Alternatively, a similar use case was posted on the forum a while ago and they ended up using the manual span interface, but feeding in data with a "tokens" property, but with one sentence as a "token". This would let you view the sentences in their natural flow, and you could double-click on them to select them, and even assign them different labels.

By feeding in sentences as tokens and using the spans.manual recipe, I'm able to get close to the setup I'd like, but there are a couple of bugs. When I highlight a sentence and assign it a label, all sentences automatically receive that same label -- even though I didn't select them (shown in the first screenshot). I haven't been able to assign sentences different labels, which is what I need.

If I highlight only part of the sentence, I get the below error message. I don't want labels for parts of sentences, but I'm worried that the fact that this results in an error means there's a bigger issue with how I've set this up. I'm including this picture in case it adds information as to what I've done wrong:

My guess is that the root of the issue is in how I've structured the input json. I'd appreciate tips on how to structure the input file so that the paragraph is visible but labels are assigned to sentences. Please also let me know if there's a different approach that might work better.

Thank you!

ryanwesslen · January 27, 2023, 7:53pm

hi @rosamond!

Thanks for your question and welcome to the Prodigy community

Thanks for the excellent background! That helped me a lot.

Would something like this work?

textcat_sent_sequence

It's a custom recipe that loops through highlighting each sentence for binary classification within context (paragraph).

It has two user experience designs to enable faster annotations. First, since it's framed as a binary task, you can quickly annotate with key bindings (no mouse or span highlighting required):

A to accept (positive)
X to reject (negative)

Second, if your input data is ordered, it may enable even faster as you annotate likely how the document was intended to be read. You could even provide more than one paragraph at a time (e.g., two or three).

I wrote up a GitHub gist that documents the steps:

gist.github.com

https://gist.github.com/wesslen/940da012837dda5d125e35e6a97f82ec

textcat_sent_sequence.py

import prodigy
import spacy
from prodigy.components.loaders import JSONL

@prodigy.recipe(
    "textcat_sent_sequence",
    dataset=("Dataset to save answers to", "positional", None, str),
    examples=("Examples to load from disk", "positional", None, str),
    model=("spaCy model to load", "positional", None, str),
    label=("Label for annotated data", "positional", None, str),

This file has been truncated. show original

Input data: paragraphs

{"paragraph": "Prodigy is a modern annotation tool for creating training and evaluation data for machine learning models. You can also use Prodigy to help you inspect and clean your data, do error analysis and develop rule-based systems to use in combination with your statistical models."}
{"paragraph": "The Python library includes a range of pre-built workflows and command-line commands for various tasks, and well-documented components for implementing your own workflow scripts. Your scripts can specify how the data is loaded and saved, change which questions are asked in the annotation interface, and can even define custom HTML and JavaScript to change the behavior of the front-end. The web application is optimized for fast, intuitive and efficient annotation."}
{"paragraph": "Prodigy’s mission is to help you do more of all those manual or semi-automatic processes that we all know we don’t do enough of. To most data scientists, the advice to spend more time looking at your data is sort of like the advice to floss or get more sleep: it’s easy to recognize that it’s sound advice, but not always easy to put into practice. Prodigy helps by giving you a practical, flexible tool that fits easily into your workflow. With concrete steps to follow instead of a vague goal, annotation and data inspection will change from something you should do, to something you will do."}

Run Prodigy Recipe

python -m prodigy textcat_sent_sequence sent_dataset input_paragraphs.jsonl en_core_web_sm MY_LABEL -F textcat_sent_sequence.py

Annotation output (first 2 examples)

If you like this recipe, I'd recommend in the recipe to change the key "sentence" to "text", as that's what Prodigy expects. I left "sentence" to be explicit on the difference between sentences and paragraphs.

{
  "paragraph": "Prodigy is a modern annotation tool for creating training and evaluation data for machine learning models. You can also use Prodigy to help you inspect and clean your data, do error analysis and develop rule-based systems to use in combination with your statistical models.",
  "sentence": "Prodigy is a modern annotation tool for creating training and evaluation data for machine learning models.",
  "label": "MY_LABEL",
  "_input_hash": -1908851693,
  "_task_hash": -2083698072,
  "_view_id": "classification",
  "answer": "accept",
  "_timestamp": 1674846351
}
{
  "paragraph": "Prodigy is a modern annotation tool for creating training and evaluation data for machine learning models. You can also use Prodigy to help you inspect and clean your data, do error analysis and develop rule-based systems to use in combination with your statistical models.",
  "sentence": "You can also use Prodigy to help you inspect and clean your data, do error analysis and develop rule-based systems to use in combination with your statistical models.",
  "label": "MY_LABEL",
  "_input_hash": -1655152078,
  "_task_hash": 1852110735,
  "_view_id": "classification",
  "answer": "reject",
  "_timestamp": 1674846353
}

With the Prodigy docs on custom recipes or prodigy-recipes repo, you should be able to generalize it from binary to multi-class or multi-label. For example, this is from textcat_manual:

    #Add labels to each task in stream
    has_options = len(label) > 1
    if has_options:
        stream = add_label_options_to_stream(stream, label)
    else:
        stream = add_labels_to_stream(stream, label)

Let me know if this works or if you're able to make progress!

rosamond · January 30, 2023, 3:53pm

Hi @ryanwesslen! Thank you so much for your help! This is exactly what I was imagining. I'll add an update to this thread once I've implemented your recommendation, but I really appreciate the quick support and solution.

rosamond · January 30, 2023, 4:52pm

@ryanwesslen thank you again, this ended up working flawlessly! Your custom recipe was easy to adapt to a multi-class task. I appreciate it!

Thanks to you and the whole Prodigy team!

Topic		Replies	Views
Workflow for sequential sentence classification usage , textcat , custom	6	955	May 15, 2020
Label multiple text at the same time	5	376	September 1, 2023
textcat by sentence given context of larger document textcat	1	782	March 1, 2018
Sentence fragments in context for classification labeling task. ner , textcat , front-end	1	436	September 8, 2020
Best approach for using ner manual and mark usage , ner , solved	22	2345	January 20, 2020

Classify sentences with paragraph visible

Input data: paragraphs

Run Prodigy Recipe

Annotation output (first 2 examples)

Related topics