textcat.correct not streaming data.

Hi, Prodigy Community!

As the title mentions, I have been trying to get textcat.correct to work and it is not streaming data to Prodigy.
I had to make some changes to the recipe in order to fit the need. I had generated the scores from the model before sending it through the textcat.correct recipe. However, I had made sure that it is the right format as you can see in the example below.

Output generated by model:

{
  "id": "xxxxxxx",
  "results": {
    "daca": [
      [
        "xxx",
        {
          "np": 0.014070617966353893,
          "rr": 0.8976043462753296,
          "or": 0.01625574380159378,
          "f": 0.0028465630020946264,
          "na": 0.0010143289109691978,
          "p": 0.007620492484420538
        }
      ]
    ],
    "us": [
      [
        "",
        {
          "np": 0,
          "rr": 0,
          "or": 0,
          "files": 0,
          "na": 0,
          "p": 0
        }
      ],
      [
        "",
        {
          "np": 0,
          "rr": 0,
          "or": 0,
          "f": 0,
          "na": 0,
          "p": 0
        }
      ]
    ]
  }
}

Data sent into textcat.correct recipe:

{
  "paper_uuid": "xxxxxxx",
  "text": "xxx",
  "answer": "reject",
  "accepted": [],
  "options": [
    {
      "id": "or",
      "text": "or",
      "meta": 0
    },
    {
      "id": "rr",
      "text": "rr",
      "meta": 0
    },
    {
      "id": "np",
      "text": "np",
      "meta": 0
    },
    {
      "id": "na",
      "text": "na",
      "meta": 0
    },
    {
      "id": "f",
      "text": "f",
      "meta": 0
    },
    {
      "id": "p",
      "text": "p",
      "meta": 0
    }
  ],
  "accept": []
}

The reformed add_suggestion function looks like the following:

def add_suggestions(stream):
        for eg in stream:
            task = copy.deepcopy(eg)
        yield task

My dataset has 10902 records. But, only 11 records were streamed in. Out of those 11 records, I was able to annotate only 1 and the other 10 were automatically accepted/rejected. I had then blocked out the update section in the recipe but nothing changed.

If anybody could shed some light on why this is happening, it'd be really helpful. Thank you.

hi @PrithaSarkar,

Thanks for your question. Sorry, you're having issues.

Can you explain more about this?

So you created a custom recipe that was developed off the textcat.correct recipe, right? Can you share that full recipe? It's a bit tough to debug this without seeing your recipe.

Can you explain why textcat.correct didn't work?

Was it because you were using a non-spaCy model? Your model output looks a little different so I'm assuming it's in a different format.

Also - did you run Prodigy logging, especially verbose when you were running your recipe?

It can provide you info if some records are skipped (e.g., if duplicates or any other problems).

Last, can you provide your Prodigy version? Run prodigy stats.

Hi, Ryan!
Thanks for getting back to me.

The custom recipe is as follows:

import copy
from typing import List, Optional
import prodigy
from prodigy.components.loaders import JSONL
from prodigy.util import split_string
import spacy
from spacy.tokens import Doc
from spacy.training import Example


# Recipe decorator with argument annotations: (description, argument type,
# shortcut, type / converter function called on value before it's passed to
# the function). Descriptions are also shown when typing --help.
@prodigy.recipe(
    "textcat.correct",
    dataset=("The dataset to use", "positional", None, str),
    spacy_model=("The base model", "positional", None, str),
    source=("The source data as a JSONL file", "positional", None, str),
    label=("One or more comma-separated labels", "option", "l", split_string),
    update=("Whether to update the model during annotation", "flag", "UP", bool),
    exclude=("Names of datasets to exclude", "option", "e", split_string),
    threshold=("Score threshold to pre-select label", "option", "t", float),
    component=("Name of text classifier component in the pipeline (will be guessed from pipeline if not set)", "option", "c", str),
)

def textcat_correct(
    dataset: str,
    spacy_model: str,
    source: str,
    label: Optional[List[str]] = None,
    update: bool = False,
    exclude: Optional[List[str]] = None,
    threshold: float = 0.5,
    component: Optional[str] = None,
):
    """
    Correct the textcat model's predictions manually. Only the predictions
    above the threshold will be pre-selected. By default, all labels with a score 0.5 and above will
    be accepted automatically. In the built-in "textcat.correct" recipe Prodigy would infer whether
    the categories should be mutualy exclusive based on the component configuration.
    Here, for demo purposes, we show how it can be inferred from the pipeline config.
    """
    # Load the stream from a JSONL file and return a generator that yields a
    # dictionary for each example in the data.
    stream = JSONL(source)

    # Load the spaCy model
    nlp = spacy.load(spacy_model)

    # Get a valid classifier component from pipeline
    if not component:
        component = "textcat" if "textcat" in nlp.pipe_names else "textcat_multilabel"

    # Infer whether the labels are exclusive from pipeline config
    pipe_config = nlp.get_pipe_config(component)
    exclusive = pipe_config.get("model", {}).get("exclusive_classes", True)

    # Get labels from the model in case they are not provided
    labels = label
    if not labels:
        labels = nlp.pipe_labels.get(component, [])
    
    def add_suggestions(stream):
        for eg in stream:
            task = copy.deepcopy(eg)
        yield task

    # Update the model with the corrected examples.
    def make_update(answers):
        examples=[]
        for eg in answers:
            if eg["answer"] == "accept":
                selected = eg.get("accept", [])
                cats = {
                    opt["id"]: 1.0 if opt["id"] in selected else 0.0
                    for opt in eg.get("options", [])
                }
                # Create a doc object to be used as a training example in the model update.
                # If your examples contain tokenization make sure not to loose this information
                # by initializing a doc object from scratch.
                doc = nlp.make_doc(eg["text"])
                examples.append(Example.from_dict(doc, {"cats": cats}))
        nlp.update(examples)

    # Add model's predictions to the tasks in the stream.
    stream = add_suggestions(stream)

    return {
        "view_id": "choice", # Annotation interface to use
        "dataset": dataset,  # Name of dataset to save annotations
        "stream": stream,  # Incoming stream of examples
        #"update": make_update if update else None,
        "exclude": exclude,  # List of dataset names to exclude
        "config": {  # Additional config settings, mostly for app UI
            # Style of choice interface
            "choice_style": "single" if exclusive and len(label) > 1 else "multiple",
            "exclude_by": "input", # Hash value to filter out seen examples
            "auto_count_stream": not update, # Whether to recount the stream at initialization 
        },
    }

Prodigy is loading the instance. However, I am just shown 1 record to annotate. While troubleshooting, I printed out task from add_sugestions function to see if records are streaming in correctly and they are. But the same is not happening when I'm trying to annotate.

The model is, in fact, a spaCy model. It's just in a pipeline which produces the output as mentioned previously and I need to reformat it to go through textcat.correct for annotation.

I did not run verbose and my Prodigy version is 1.11.8.

Issue is resolved when prodigy instance is run on its own and not on Docker.

Thanks for the update!

By the way, I wanted to mention a couple of things too.

I noticed you were loading your input with JSONL. I suspect you found this textcat.correct from our prodigy-recipes repo. That's fine, but Prodigy's built-in recipes use get_stream instead of JSONL. There are several benefits of get_stream (dedup, hashing, read from standard input, etc.).

The recipes in this repository aren't 100% identical to the built-in recipes shipped with Prodigy. They've been edited to include comments and more information, and some of them have been simplified to make it easier to follow what's going on and to use them as the basis for a custom recipe. We have an open ticket to update this repo soon.

If you're curious how the built-in recipes look like, you can find them in your Prodigy install by looking for the Location: of your prodigy stats output, and then going to the recipes folder. For textcat.correct, you'd look for recipes/textcat.py. From there, you'll get a better picture of how to use get_stream in your recipe.

Related, I asked about Prodigy version because in v1.12, we slightly modified the stream logic like using stream.apply (see here). Your recipe should work fine for any v1.11 version, but you may have to make a few tweaks if you want to use for v1.12.

Hope this helps!