get_session_questions takes many time when use a sorter and always return same example

info2000 · May 27, 2022, 12:53am

I added prefer_low_scores to my textcat.correct, and after that, get new questions takes around 1 minute and return a batch of 5 items (right) with the same task id

and the dataset has more 4000 items without annotate
The same issue happens with prefer_uncertain sorter, tried with both algorithms

1)what could be the reason to return the same task multiple times?
2)there's some way to optimize the times when use sorters?

koaning · May 27, 2022, 7:16am

Could you share the command that you ran? When I look at the docs then I can see a --threshold parameter but not a parameter that allows you to set the sorting preference. Are you running a custom recipe?

One reason for the lag might be explained by the way Prodigy filters data for labelling. The model checks each example in the stream to confirm if it meets the threshold value. If it doesn't, the next item in the stream is checked. This goes on until we have enough examples that we can send to the user.

It could be that the threshold is configured in such a way that it's relatively rare to find examples. This might explain the long wait, but it's hard to know for sure without having access your machine. If you give it a much smaller batch, does it still take that long?

info2000 · May 27, 2022, 1:45pm

I'm using the teach.correct recipe customizing the interface and trying with the sorters to correct only the examples with any label with more 0.60 sc

The --threshold on correct recipe doesn't discard the example, only show it "pre-select" on the interface that labels
Otherwise, I ran experiments with and without it happening the same

koaning · May 30, 2022, 8:37am

I'm using the teach.correct recipe customizing the interface

Does that mean that you're using a custom recipe? If so, could you share the code?

One thing that I wonder is, what kind of data are you labelling? Are you labelling very long documents or short texts?

info2000 · May 30, 2022, 8:20pm

Long documents +6k chars (I should add a token counter to my examples)

import copy
from typing import List, Optional
import prodigy
from prodigy.components.loaders import JSONL
from prodigy.util import split_string
import spacy
from spacy.tokens import Doc
from spacy.training import Example
from prodigy.models.textcat import TextClassifier
from prodigy.components.sorters import prefer_uncertain,prefer_low_scores
 
@prodigy.recipe(
    "textcat.correct",
    dataset=("The dataset to use", "positional", None, str),
    spacy_model=("The base model", "positional", None, str),
    source=("The source data as a JSONL file", "positional", None, str),
    label=("One or more comma-separated labels", "option", "l", split_string),
    update=("Whether to update the model during annotation", "flag", "UP", bool),
    exclude=("Names of datasets to exclude", "option", "e", split_string),
    threshold=("Score threshold to pre-select label", "option", "t", float),
    component=("Name of text classifier component in the pipeline (will be guessed from pipeline if not set)", "option", "c", str),
)

def textcat_correct(
    dataset: str,
    spacy_model: str,
    source: str,
    label: Optional[List[str]] = None,
    update: bool = False,
    exclude: Optional[List[str]] = None,
    threshold: float = 0.5,
    component: Optional[str] = None,
):
 
    stream = JSONL(source)
 
    nlp = spacy.load(spacy_model)
    
 
    if not component:
        component = "textcat" if "textcat" in nlp.pipe_names else "textcat_multilabel"
 
    pipe_config = nlp.get_pipe_config(component)
    exclusive = pipe_config.get("model", {}).get("exclusive_classes", True)

    labels = label
    if not labels:
        labels = nlp.pipe_labels.get(component, [])
        
    model = TextClassifier(nlp, labels, component)
    predict = model
      def add_suggestions(stream):
        texts = ((eg["text"], eg) for eg in stream)
        i=0
        for doc, eg in nlp.pipe(texts, as_tuples=True, batch_size=10):
            task = copy.deepcopy(eg)
            options = []
            selected = []
            i+=1
            for cat, score in doc.cats.items():
                if cat in labels:
                    options.append({"id": cat, "text": cat, "meta": f"{score:.2f}"})
                    if score >= threshold:
                        selected.append(cat)
            task["options"] = options
            task["accept"] = selected
            yield task
        

    def make_update(answers):
        examples=[]
        for eg in answers:
            if eg["answer"] == "accept":
                selected = eg.get("accept", [])
                cats = {                    
                    opt["id"]: 1.0 if opt["id"] in selected else 0.0
                    for opt in eg.get("options", [])
                }
                doc = nlp.make_doc(eg["text"])
                examples.append(Example.from_dict(doc, {"cats": cats}))
        nlp.update(examples)
		
    stream = add_suggestions(stream)
    
 
    stream = prefer_uncertain(model(stream), algorithm="ema")
    return {      
        "dataset": dataset,  # Name of dataset to save annotations
        "stream": stream,  # Incoming stream of examples
        "update": make_update if update else None,
        "exclude": exclude,  # List of dataset names to exclude
 
    }

koaning · May 31, 2022, 9:12am

Could you make sure that the code blocks render nicely? That makes it much easier to inspect the code.

I can imagine that very long documents take a lot longer to parse, so that might explain the behaviour that you're seeing. Could you confirm if you're seeing the same behaviour on shorter texts?

info2000 · May 31, 2022, 2:41pm

fixed the code rendering

test with sort texts: I will do the test along this week

Topic		Replies	Views
textcat.teach not using active learning textcat , solved	9	1396	April 10, 2018
textcat.teach to show all the docs in stream, despite their score textcat , spacy	5	578	August 7, 2018
Scoring and sorting all samples during textcat teach usage , textcat	2	534	November 2, 2020
Textcat.teach running out of tasks, but they are there on refresh textcat	3	308	May 28, 2021
Sorter Batch Size / Local Sorters? enhancement , usage , api	1	1027	March 14, 2018

get_session_questions takes many time when use a sorter and always return same example

Related topics