How can I training a textcat have thousands label.

Astral1020 · June 20, 2019, 7:53am

I am sorry about I am not a english speaker, maybe there has lots of mistake.I hope you forgive me.
In my case, I need build a model to classify article. In the example there only 1 label. I need thousands label.

because i have too many label.so there is what i do:

1. build a jsonl file

{"text":"article_text1","label":"label1","answer":"reject"}
{"text":"article_text1","label":"label2","answer":"accept"}
{"text":"article_text1","label":"label3","answer":"accept"}
{"text":"article_text2","label":"label1","answer":"reject"}
{"text":"article_text2","label":"label2","answer":"accept"}
{"text":"article_text2","label":"label3","answer":"reject"}
......

2. import into dataset 
3. use "textcat.batch-train"

first time I import 49760 row data from jsonl into dataset it cover 80 labels
new model looks good. but new problem come -- I need add more label into model
if I continue add row data in same dataset, it will become very huge.
"textcat.batch-train" will very slow, the data of full amount of label will be a disaster.
if I use a new dataset and a trained model, it will raise a exception:
"ValueError: operands could not be broadcast together with shapes"

how can i do for iterate, I need continuous add label.

ines · June 20, 2019, 8:13am

Hi and no problem!

Prodigy's batch-train recipes are better for small and quick experiments. For steps 1 to 3, it's easier to train the model directly in spaCy. Then you don't have to think about the Prodigy data format. See here for a code example: Training Pipelines & Models · spaCy Usage Documentation

Updating a model in spaCy looks like this. This is much easier with 80+ labels:

texts = ["a text"]
annotations = [{"cats": {"LABEL1": True, "LABEL2": False, "LABEL3": True}}]  # and so on...
nlp.update(texts, annotations)

Here's one idea for a solution:

Use Prodigy and textcat.manual to create more annotations.
Use db-out to export the annotations to a file.
Convert them to the same format as your other annotations.
Add them to your spaCy training set.
Train a new model in spaCy.

Also remember that it's currently not possible to add new labels to a pre-trained text classification model. See this spaCy issue for details. This means that you always need to train your model from scratch with all examples. Maybe this was also the reason for the error you saw.

Astral1020 · June 20, 2019, 9:16am

Thanks for reply, I will try it by your idea.

Topic		Replies	Views
training data format for multiclass textcat Getting Started usage , textcat	7	1562	August 29, 2022
Best use of `textcat.teach` usage , textcat	2	1433	June 18, 2020
Textcat model with multiple classes usage , textcat	5	1536	November 1, 2019
Textcat - teach to train. usage , textcat	2	553	September 1, 2022
textcat teach examples from source or from dataset usage , textcat	10	839	August 15, 2019

How can I training a textcat have thousands label.

Related topics