Classify sentences with paragraph visible

ryanwesslen · January 27, 2023, 7:53pm

Thanks for your question and welcome to the Prodigy community

Thanks for the excellent background! That helped me a lot.

Would something like this work?

textcat_sent_sequence

It's a custom recipe that loops through highlighting each sentence for binary classification within context (paragraph).

It has two user experience designs to enable faster annotations. First, since it's framed as a binary task, you can quickly annotate with key bindings (no mouse or span highlighting required):

A to accept (positive)
X to reject (negative)

Second, if your input data is ordered, it may enable even faster as you annotate likely how the document was intended to be read. You could even provide more than one paragraph at a time (e.g., two or three).

I wrote up a GitHub gist that documents the steps:

gist.github.com

https://gist.github.com/wesslen/940da012837dda5d125e35e6a97f82ec

textcat_sent_sequence.py

import prodigy
import spacy
from prodigy.components.loaders import JSONL

@prodigy.recipe(
    "textcat_sent_sequence",
    dataset=("Dataset to save answers to", "positional", None, str),
    examples=("Examples to load from disk", "positional", None, str),
    model=("spaCy model to load", "positional", None, str),
    label=("Label for annotated data", "positional", None, str),

This file has been truncated. show original

Input data: paragraphs

{"paragraph": "Prodigy is a modern annotation tool for creating training and evaluation data for machine learning models. You can also use Prodigy to help you inspect and clean your data, do error analysis and develop rule-based systems to use in combination with your statistical models."}
{"paragraph": "The Python library includes a range of pre-built workflows and command-line commands for various tasks, and well-documented components for implementing your own workflow scripts. Your scripts can specify how the data is loaded and saved, change which questions are asked in the annotation interface, and can even define custom HTML and JavaScript to change the behavior of the front-end. The web application is optimized for fast, intuitive and efficient annotation."}
{"paragraph": "Prodigy’s mission is to help you do more of all those manual or semi-automatic processes that we all know we don’t do enough of. To most data scientists, the advice to spend more time looking at your data is sort of like the advice to floss or get more sleep: it’s easy to recognize that it’s sound advice, but not always easy to put into practice. Prodigy helps by giving you a practical, flexible tool that fits easily into your workflow. With concrete steps to follow instead of a vague goal, annotation and data inspection will change from something you should do, to something you will do."}

Run Prodigy Recipe

python -m prodigy textcat_sent_sequence sent_dataset input_paragraphs.jsonl en_core_web_sm MY_LABEL -F textcat_sent_sequence.py

Annotation output (first 2 examples)

If you like this recipe, I'd recommend in the recipe to change the key "sentence" to "text", as that's what Prodigy expects. I left "sentence" to be explicit on the difference between sentences and paragraphs.

{
  "paragraph": "Prodigy is a modern annotation tool for creating training and evaluation data for machine learning models. You can also use Prodigy to help you inspect and clean your data, do error analysis and develop rule-based systems to use in combination with your statistical models.",
  "sentence": "Prodigy is a modern annotation tool for creating training and evaluation data for machine learning models.",
  "label": "MY_LABEL",
  "_input_hash": -1908851693,
  "_task_hash": -2083698072,
  "_view_id": "classification",
  "answer": "accept",
  "_timestamp": 1674846351
}
{
  "paragraph": "Prodigy is a modern annotation tool for creating training and evaluation data for machine learning models. You can also use Prodigy to help you inspect and clean your data, do error analysis and develop rule-based systems to use in combination with your statistical models.",
  "sentence": "You can also use Prodigy to help you inspect and clean your data, do error analysis and develop rule-based systems to use in combination with your statistical models.",
  "label": "MY_LABEL",
  "_input_hash": -1655152078,
  "_task_hash": 1852110735,
  "_view_id": "classification",
  "answer": "reject",
  "_timestamp": 1674846353
}

With the Prodigy docs on custom recipes or prodigy-recipes repo, you should be able to generalize it from binary to multi-class or multi-label. For example, this is from textcat_manual:

    #Add labels to each task in stream
    has_options = len(label) > 1
    if has_options:
        stream = add_label_options_to_stream(stream, label)
    else:
        stream = add_labels_to_stream(stream, label)

Let me know if this works or if you're able to make progress!

Topic		Replies	Views
Marking sentences for classification usage , textcat , custom	3	1089	April 28, 2020
Multiple categories assigned to paragraphs or sentences? usage , textcat , solved	2	584	April 5, 2018
textcat by sentence given context of larger document textcat	1	782	March 1, 2018
Workflow for sequential sentence classification usage , textcat , custom	6	953	May 15, 2020
Sentence fragments in context for classification labeling task. ner , textcat , front-end	1	436	September 8, 2020

Classify sentences with paragraph visible

Input data: paragraphs

Run Prodigy Recipe

Annotation output (first 2 examples)

Related topics