textcat.batch-train error "operands could not be broadcast together..."

MBSanchez · September 23, 2019, 4:52pm

Hi!

I was trying to use the textcat.batch-train recipe using a model a have used for the annotations and the dataset I was saving my annotations to:
textcat.batch-train data_projects_pretrained_v18 model_projects_pretrained_v18 --output model_projects_pretrained_v18_v2

But there must be something wrong because it is giving me this error:

Loaded model model_projects_pretrained_v18
Using 50% of examples (332) for evaluation
Using 100% of remaining examples (333) for training
Dropout: 0.2  Batch size: 10  Iterations: 10  

#            LOSS         F-SCORE      ACCURACY  
Traceback (most recent call last):                                                                                                                                                                          
  File "/home/ubuntu/miniconda2/envs/prodigyenv/lib/python3.5/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/ubuntu/miniconda2/envs/prodigyenv/lib/python3.5/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/ubuntu/miniconda2/envs/prodigyenv/lib/python3.5/site-packages/prodigy/__main__.py", line 380, in <module>
    controller = recipe(*args, use_plac=True)
  File "cython_src/prodigy/core.pyx", line 212, in prodigy.core.recipe.recipe_decorator.recipe_proxy
  File "/home/ubuntu/.local/lib/python3.5/site-packages/plac_core.py", line 328, in call
    cmd, result = parser.consume(arglist)
  File "/home/ubuntu/.local/lib/python3.5/site-packages/plac_core.py", line 207, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "/home/ubuntu/miniconda2/envs/prodigyenv/lib/python3.5/site-packages/prodigy/recipes/textcat.py", line 255, in batch_train
    loss += model.update(batch, revise=False, drop=dropout)
  File "cython_src/prodigy/models/textcat.pyx", line 235, in prodigy.models.textcat.TextClassifier.update
  File "cython_src/prodigy/models/textcat.pyx", line 252, in prodigy.models.textcat.TextClassifier._update
  File "pipes.pyx", line 932, in spacy.pipeline.pipes.TextCategorizer.update
  File "pipes.pyx", line 960, in spacy.pipeline.pipes.TextCategorizer.get_loss
ValueError: operands could not be broadcast together with shapes (8,10) (8,8)

Do you have an idea of what could be wrong here?

Thanks!

honnibal · September 24, 2019, 9:04am

Hi,

It seems to me that you're using an input model to textcat.batch-train that already has a text classification model. Then the data has extra classes that aren't in the input model, leading to the error. If this is the case, then I think we could probably do something to detect this and raise a better error.

The solution is to train from a model that doesn't already have a text classification model. For instance, you're passing in data_projects_pretrained_v18. Instead if you're working with English, you might use en_vectors_web_lg as the base model, or perhaps download a language-specific model from https://fasttext.cc and convert it with the spacy init-model command.

You usually don't want to train a model "on top" of an already existing one, if it's possible to instead train from a freshly initialised one. Training from an existing model is hard to reason about and hard to replicate, because you're starting from a fairly arbitrary intermediate state. The replication problem is especially important: if you're always training on top of a previous model, then in order to replicate your work you would have to retrain in several steps, each time taking care to use exactly the data you used. This will probably be infeasible, which means that you're likely to end up with a model you can't recreate.

MBSanchez · September 24, 2019, 10:01am

Thanks for your help!

I was confused because I thought that I had to train the model again and again.

What I am doing is:

Create a first model with textcat.batch-train using existing labelled data and en_vectors_web_lg. Lets say that the output model for this recipe is model_pretrained_v18.
I annotate manually using textcat.teach and model_pretrained_v18.
I know the model_pretrained_v18 has been updated on the loop in the previous step but I thought I had to run anyway the textcat.batch-train again to get a more accurate model thats why I am running textcat.batch-train data_projects_pretrained_v18 model_projects_pretrained_v18 --output model_projects_pretrained_v18_v2
because I wanted to go to step 2 and annotate again on the resulting model_projects_pretrained_v18_v2.

If I understood properly, the model is being updated while I am annotating and I don't need to run textcat.batch-train again, lets say after 1000 annotations to get a more accurate model to continue annotating using this "new version".

If I go back and train the model using en_vectors_web_lg is like I am loosing the updates of the model while annotating, isn't it?

Thanks for clarifying!

honnibal · September 24, 2019, 10:52am

So long as you're training the model with all of the data, there's no need to worry about "losing the updates". It's fine to train again from the initial state, just with more data. The model will relearn what it had previously, but with the extra examples as well.

You can think of this the same way as learning from a large benchmark corpus like mnist or the imdb data. You wouldn't train those models by first learning on 10% of the data, and then on the next 10%, etc. Instead you just start from a random model and make several passes over the whole dataset. The same idea applies when you're running the textcat.batch-train.

The only exception comes when you want to use information from a pretrained model and you don't have the original data available. This happens when you're adding an entity type to an NER model, for instance. But if you do have all the data, you'll want to start from a random initialisation.

MBSanchez · September 24, 2019, 1:47pm

Great! It is much more clear now.

Thanks for your help again!

Topic		Replies	Views
Error running textcat.batch-train if text is empty string textcat , done	16	1696	November 20, 2017
textcat.batch-train question	7	498	November 28, 2022
Traning/validation in Textcat/ textcat , spacy , off-topic	0	1183	May 26, 2020
How can I training a textcat have thousands label. usage , textcat , spacy	2	1330	June 20, 2019
textcat.batch-train throws AttributeError in 1.5.1 spacy , solved	2	532	June 16, 2018

textcat.batch-train error "operands could not be broadcast together..."

Related topics