How can I training a textcat have thousands label.

I am sorry about I am not a english speaker, maybe there has lots of mistake.I hope you forgive me.
In my case, I need build a model to classify article. In the example there only 1 label. I need thousands label.

because i have too many there is what i do:

1. build a jsonl file
2. import into dataset 
3. use "textcat.batch-train"

first time I import 49760 row data from jsonl into dataset it cover 80 labels
new model looks good. but new problem come – I need add more label into model
if I continue add row data in same dataset, it will become very huge.
“textcat.batch-train” will very slow, the data of full amount of label will be a disaster.
if I use a new dataset and a trained model, it will raise a exception:
“ValueError: operands could not be broadcast together with shapes”

how can i do for iterate, I need continuous add label.

Hi and no problem! :slight_smile:

Prodigy's batch-train recipes are better for small and quick experiments. For steps 1 to 3, it's easier to train the model directly in spaCy. Then you don't have to think about the Prodigy data format. See here for a code example:

Updating a model in spaCy looks like this. This is much easier with 80+ labels:

texts = ["a text"]
annotations = [{"cats": {"LABEL1": True, "LABEL2": False, "LABEL3": True}}]  # and so on...
nlp.update(texts, annotations)

Here's one idea for a solution:

  1. Use Prodigy and textcat.manual to create more annotations.
  2. Use db-out to export the annotations to a file.
  3. Convert them to the same format as your other annotations.
  4. Add them to your spaCy training set.
  5. Train a new model in spaCy.

Also remember that it's currently not possible to add new labels to a pre-trained text classification model. See this spaCy issue for details. This means that you always need to train your model from scratch with all examples. Maybe this was also the reason for the error you saw.

Thanks for reply, I will try it by your idea.