Thanks for the background!
I assume that you can run --textcat-multilabel
, right?
python -m prodigy train ebf/ --textcat-multilabel ebf_0,ebf_1,ebf_2 --eval-split 0.2 --base-model en_core_web_lg
I definitely understand the point that this is a bit confusing (why would binary classification use a textcat_multilabel
?) so maybe such a helper function could do the trick. As the thread above explains, some of this is due to how spaCy handles this and try to match up to it in Prodigy.
A bit of a hack, but another approach would be for the original .jsonl
/ dataset, to modify every "answer": "reject"
by changing the label to "not_economy_business_finance"
and switching to "answer": "accept"
. So in this case if you want to convert original labels from a Prodigy dataset (samp-textcat
) to a new Prodigy dataset (samp-textcat-new
), you can run this:
from prodigy.components.db import connect
# pull examples from dataset
db = connect()
examples = db.get_dataset("samp-textcat")
# modify change rejects to "not_" as accepts
new_examples = []
for eg in examples:
if eg["answer"] == "reject":
eg["label"] = "not_" + str(eg["label"])
eg["answer"] = "accept"
new_examples.append(eg)
# create new Prodigy dataset
db.add_dataset("samp-textcat-new", session=True)
db.add_examples(new_examples, datasets=["samp-textcat-new"])
I tried this and running samp-textcat-new
and I could run train --textcat
when previously I could only run train --textcat-multilabel
. Does this work for you?
If you wanted to avoid this from the start, you need to have two labels in your annotation, which you can just add in a second with a not_
prefix.
textcat.teach ebf_0 en_core_web_sm ./data/articles_0.jsonl --label economy_business_finance,not_economy_business_finance --patterns ./ebf/patterns.jsonl
Thanks again for the question! I think there could be small improvements in the future to avoid this problem.