get_session_questions takes many time when use a sorter and always return same example

I added prefer_low_scores to my textcat.correct, and after that, get new questions takes around 1 minute and return a batch of 5 items (right) with the same task id

and the dataset has more 4000 items without annotate
The same issue happens with prefer_uncertain sorter, tried with both algorithms

1)what could be the reason to return the same task multiple times?
2)there's some way to optimize the times when use sorters?

Could you share the command that you ran? When I look at the docs then I can see a --threshold parameter but not a parameter that allows you to set the sorting preference. Are you running a custom recipe?

One reason for the lag might be explained by the way Prodigy filters data for labelling. The model checks each example in the stream to confirm if it meets the threshold value. If it doesn't, the next item in the stream is checked. This goes on until we have enough examples that we can send to the user.

It could be that the threshold is configured in such a way that it's relatively rare to find examples. This might explain the long wait, but it's hard to know for sure without having access your machine. If you give it a much smaller batch, does it still take that long?

I'm using the teach.correct recipe customizing the interface and trying with the sorters to correct only the examples with any label with more 0.60 sc

The --threshold on correct recipe doesn't discard the example, only show it "pre-select" on the interface that labels
Otherwise, I ran experiments with and without it happening the same

I'm using the teach.correct recipe customizing the interface

Does that mean that you're using a custom recipe? If so, could you share the code?

One thing that I wonder is, what kind of data are you labelling? Are you labelling very long documents or short texts?

Long documents +6k chars (I should add a token counter to my examples)

import copy
from typing import List, Optional
import prodigy
from prodigy.components.loaders import JSONL
from prodigy.util import split_string
import spacy
from spacy.tokens import Doc
from import Example
from prodigy.models.textcat import TextClassifier
from prodigy.components.sorters import prefer_uncertain,prefer_low_scores
    dataset=("The dataset to use", "positional", None, str),
    spacy_model=("The base model", "positional", None, str),
    source=("The source data as a JSONL file", "positional", None, str),
    label=("One or more comma-separated labels", "option", "l", split_string),
    update=("Whether to update the model during annotation", "flag", "UP", bool),
    exclude=("Names of datasets to exclude", "option", "e", split_string),
    threshold=("Score threshold to pre-select label", "option", "t", float),
    component=("Name of text classifier component in the pipeline (will be guessed from pipeline if not set)", "option", "c", str),

def textcat_correct(
    dataset: str,
    spacy_model: str,
    source: str,
    label: Optional[List[str]] = None,
    update: bool = False,
    exclude: Optional[List[str]] = None,
    threshold: float = 0.5,
    component: Optional[str] = None,
    stream = JSONL(source)
    nlp = spacy.load(spacy_model)
    if not component:
        component = "textcat" if "textcat" in nlp.pipe_names else "textcat_multilabel"
    pipe_config = nlp.get_pipe_config(component)
    exclusive = pipe_config.get("model", {}).get("exclusive_classes", True)

    labels = label
    if not labels:
        labels = nlp.pipe_labels.get(component, [])
    model = TextClassifier(nlp, labels, component)
    predict = model
      def add_suggestions(stream):
        texts = ((eg["text"], eg) for eg in stream)
        for doc, eg in nlp.pipe(texts, as_tuples=True, batch_size=10):
            task = copy.deepcopy(eg)
            options = []
            selected = []
            for cat, score in doc.cats.items():
                if cat in labels:
                    options.append({"id": cat, "text": cat, "meta": f"{score:.2f}"})
                    if score >= threshold:
            task["options"] = options
            task["accept"] = selected
            yield task

    def make_update(answers):
        for eg in answers:
            if eg["answer"] == "accept":
                selected = eg.get("accept", [])
                cats = {                    
                    opt["id"]: 1.0 if opt["id"] in selected else 0.0
                    for opt in eg.get("options", [])
                doc = nlp.make_doc(eg["text"])
                examples.append(Example.from_dict(doc, {"cats": cats}))
    stream = add_suggestions(stream)
    stream = prefer_uncertain(model(stream), algorithm="ema")
    return {      
        "dataset": dataset,  # Name of dataset to save annotations
        "stream": stream,  # Incoming stream of examples
        "update": make_update if update else None,
        "exclude": exclude,  # List of dataset names to exclude

Could you make sure that the code blocks render nicely? That makes it much easier to inspect the code.

I can imagine that very long documents take a lot longer to parse, so that might explain the behaviour that you're seeing. Could you confirm if you're seeing the same behaviour on shorter texts?

fixed the code rendering

test with sort texts: I will do the test along this week

1 Like