Custom spacy pipe for Prodigy view

I have a text classification model that was built elsewhere (via sklearn) and would like to view the output on Prodigy using a custom recipe.

@recipe(
    "custom_textcat.demo",
    dataset=recipe_args["dataset"],
    source=recipe_args["source"],
)
def demo(dataset, source):
    nlp = spacy.load('en_core_web_md', disable=["ner", "parser"])
    with open('textcat.pkl'), 'rb') as file:
        textcat_model = pickle.load(file)

    def textcat_pipe(doc):
        scores = textcat_model.decision_function([doc.vector])
        for i, cat in enumerate(jdcat_model.classes_):
            doc.cats[cat] = scores[0][i]
        return doc
    nlp.add_pipe(textcat_pipe, 'custom_textcat')

    model = TextClassifier(nlp, textcat_model.classes_)
    stream = get_stream(source, rehash=True, dedup=True, input_key="text")
    stream = (eg for score, eg in model(stream))

    return {
        "view_id": "classification",
        "dataset": dataset,
        "stream": stream,
        "config": {"lang": nlp.lang, "labels": textcat_model.classes_},
    }

The recipe runs but the predictions presented on the web app seems irrelevant to the model's prediction. Is there an example of defining a custom TextClassifier?

(As a workaround, I generate an annotated JSONL data file from the model and use pgy mark to accomplish the same thing for now.)

Thanks!

Hi! Your recipe looks good so far :slightly_smiling_face: I think what might be happening here is that the TextClassifier (Prodigy's annotation model) returns a scored stream of all predictions, that you can then filter by using a sorter like prefer_high_scores, prefer_uncertain etc. If you're not doing that, you're just seeing an unfiltered stream of everything, which is not that useful.

Instead of creating the JSONL file, you could also just do this in the recipe and run the nlp object over the "text" of each incoming example and then add a "label" to the task depending on the score. I don't know which metric you used to decide whether a label applies or not, and if you have multiple labels. But basically, just do whatever you do in the script that generated your JSONL. For instance:

def get_stream(stream):
    for eg in stream:
        doc = nlp(eg["text"])
        for cat, score in doc.cats.items():
            if score > 0.5:
                task = copy.deepcopy(eg)
                task["label"] = cat
                task["meta"] = {"score": score}
                yield task

And to make it more efficient, you can use nlp.pipe with as_tuples to process the texts and example dicts as a stream:

eg_tuples = ((eg["text"], eg) for eg in stream)
for doc in nlp.pipe(eg_tuples, as_tuples=True):
    # etc.

Hi Ines,

Thanks for the response. I was going to use prefer_high_scores (to impress my colleagues :laughing:). But I thought not filtering would be the fair demonstration of what it would be like in production, should it be deployed, am I right?

One problem I saw was that the scores that I was seeing the web interface was not the score the model should be providing. I thought it would be a softmax of the scores, is that correct? (And those scores are the ones that are used by the filtering functions, right?)

Lastly, thanks for the workaround. That is better than making a JSONL file every time the model is updated. :sweat_smile: I will give it a try!