textcat vs textcat_multilabel

Thanks for the background!

I assume that you can run --textcat-multilabel, right?

python -m prodigy train ebf/ --textcat-multilabel ebf_0,ebf_1,ebf_2 --eval-split 0.2 --base-model en_core_web_lg

I definitely understand the point that this is a bit confusing (why would binary classification use a textcat_multilabel?) so maybe such a helper function could do the trick. As the thread above explains, some of this is due to how spaCy handles this and try to match up to it in Prodigy.

A bit of a hack, but another approach would be for the original .jsonl / dataset, to modify every "answer": "reject" by changing the label to "not_economy_business_finance" and switching to "answer": "accept". So in this case if you want to convert original labels from a Prodigy dataset (samp-textcat) to a new Prodigy dataset (samp-textcat-new), you can run this:

from prodigy.components.db import connect

# pull examples from dataset
db = connect()
examples = db.get_dataset("samp-textcat")

# modify change rejects to "not_" as accepts
new_examples = []
for eg in examples:
    if eg["answer"] == "reject":
        eg["label"] = "not_" + str(eg["label"])
        eg["answer"] = "accept"
    new_examples.append(eg)

# create new Prodigy dataset
db.add_dataset("samp-textcat-new", session=True)
db.add_examples(new_examples, datasets=["samp-textcat-new"])

I tried this and running samp-textcat-new and I could run train --textcat when previously I could only run train --textcat-multilabel. Does this work for you?

If you wanted to avoid this from the start, you need to have two labels in your annotation, which you can just add in a second with a not_ prefix.

textcat.teach ebf_0 en_core_web_sm ./data/articles_0.jsonl --label economy_business_finance,not_economy_business_finance --patterns ./ebf/patterns.jsonl

Thanks again for the question! I think there could be small improvements in the future to avoid this problem.