Classify sentences with paragraph visible

hi @rosamond!

Thanks for your question and welcome to the Prodigy community :wave:

Thanks for the excellent background! That helped me a lot.

Would something like this work?

textcat_sent_sequence

It's a custom recipe that loops through highlighting each sentence for binary classification within context (paragraph).

It has two user experience designs to enable faster annotations. First, since it's framed as a binary task, you can quickly annotate with key bindings (no mouse or span highlighting required):

  • A to accept (positive)
  • X to reject (negative)

Second, if your input data is ordered, it may enable even faster as you annotate likely how the document was intended to be read. You could even provide more than one paragraph at a time (e.g., two or three).

I wrote up a GitHub gist that documents the steps:

Input data: paragraphs

{"paragraph": "Prodigy is a modern annotation tool for creating training and evaluation data for machine learning models. You can also use Prodigy to help you inspect and clean your data, do error analysis and develop rule-based systems to use in combination with your statistical models."}
{"paragraph": "The Python library includes a range of pre-built workflows and command-line commands for various tasks, and well-documented components for implementing your own workflow scripts. Your scripts can specify how the data is loaded and saved, change which questions are asked in the annotation interface, and can even define custom HTML and JavaScript to change the behavior of the front-end. The web application is optimized for fast, intuitive and efficient annotation."}
{"paragraph": "Prodigy’s mission is to help you do more of all those manual or semi-automatic processes that we all know we don’t do enough of. To most data scientists, the advice to spend more time looking at your data is sort of like the advice to floss or get more sleep: it’s easy to recognize that it’s sound advice, but not always easy to put into practice. Prodigy helps by giving you a practical, flexible tool that fits easily into your workflow. With concrete steps to follow instead of a vague goal, annotation and data inspection will change from something you should do, to something you will do."}

Run Prodigy Recipe

python -m prodigy textcat_sent_sequence sent_dataset input_paragraphs.jsonl en_core_web_sm MY_LABEL -F textcat_sent_sequence.py

Annotation output (first 2 examples)

If you like this recipe, I'd recommend in the recipe to change the key "sentence" to "text", as that's what Prodigy expects. I left "sentence" to be explicit on the difference between sentences and paragraphs.

{
  "paragraph": "Prodigy is a modern annotation tool for creating training and evaluation data for machine learning models. You can also use Prodigy to help you inspect and clean your data, do error analysis and develop rule-based systems to use in combination with your statistical models.",
  "sentence": "Prodigy is a modern annotation tool for creating training and evaluation data for machine learning models.",
  "label": "MY_LABEL",
  "_input_hash": -1908851693,
  "_task_hash": -2083698072,
  "_view_id": "classification",
  "answer": "accept",
  "_timestamp": 1674846351
}
{
  "paragraph": "Prodigy is a modern annotation tool for creating training and evaluation data for machine learning models. You can also use Prodigy to help you inspect and clean your data, do error analysis and develop rule-based systems to use in combination with your statistical models.",
  "sentence": "You can also use Prodigy to help you inspect and clean your data, do error analysis and develop rule-based systems to use in combination with your statistical models.",
  "label": "MY_LABEL",
  "_input_hash": -1655152078,
  "_task_hash": 1852110735,
  "_view_id": "classification",
  "answer": "reject",
  "_timestamp": 1674846353
}

With the Prodigy docs on custom recipes or prodigy-recipes repo, you should be able to generalize it from binary to multi-class or multi-label. For example, this is from textcat_manual:

    #Add labels to each task in stream
    has_options = len(label) > 1
    if has_options:
        stream = add_label_options_to_stream(stream, label)
    else:
        stream = add_labels_to_stream(stream, label)

Let me know if this works or if you're able to make progress!

3 Likes