textcat.batch-train error "operands could not be broadcast together..."

Hi!

I was trying to use the textcat.batch-train recipe using a model a have used for the annotations and the dataset I was saving my annotations to:
textcat.batch-train data_projects_pretrained_v18 model_projects_pretrained_v18 --output model_projects_pretrained_v18_v2

But there must be something wrong because it is giving me this error:

Loaded model model_projects_pretrained_v18
Using 50% of examples (332) for evaluation
Using 100% of remaining examples (333) for training
Dropout: 0.2  Batch size: 10  Iterations: 10  

#            LOSS         F-SCORE      ACCURACY  
Traceback (most recent call last):                                                                                                                                                                          
  File "/home/ubuntu/miniconda2/envs/prodigyenv/lib/python3.5/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/ubuntu/miniconda2/envs/prodigyenv/lib/python3.5/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/ubuntu/miniconda2/envs/prodigyenv/lib/python3.5/site-packages/prodigy/__main__.py", line 380, in <module>
    controller = recipe(*args, use_plac=True)
  File "cython_src/prodigy/core.pyx", line 212, in prodigy.core.recipe.recipe_decorator.recipe_proxy
  File "/home/ubuntu/.local/lib/python3.5/site-packages/plac_core.py", line 328, in call
    cmd, result = parser.consume(arglist)
  File "/home/ubuntu/.local/lib/python3.5/site-packages/plac_core.py", line 207, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "/home/ubuntu/miniconda2/envs/prodigyenv/lib/python3.5/site-packages/prodigy/recipes/textcat.py", line 255, in batch_train
    loss += model.update(batch, revise=False, drop=dropout)
  File "cython_src/prodigy/models/textcat.pyx", line 235, in prodigy.models.textcat.TextClassifier.update
  File "cython_src/prodigy/models/textcat.pyx", line 252, in prodigy.models.textcat.TextClassifier._update
  File "pipes.pyx", line 932, in spacy.pipeline.pipes.TextCategorizer.update
  File "pipes.pyx", line 960, in spacy.pipeline.pipes.TextCategorizer.get_loss
ValueError: operands could not be broadcast together with shapes (8,10) (8,8) 

Do you have an idea of what could be wrong here?

Thanks!

Hi,

It seems to me that you're using an input model to textcat.batch-train that already has a text classification model. Then the data has extra classes that aren't in the input model, leading to the error. If this is the case, then I think we could probably do something to detect this and raise a better error.

The solution is to train from a model that doesn't already have a text classification model. For instance, you're passing in data_projects_pretrained_v18. Instead if you're working with English, you might use en_vectors_web_lg as the base model, or perhaps download a language-specific model from https://fasttext.cc and convert it with the spacy init-model command.

You usually don't want to train a model "on top" of an already existing one, if it's possible to instead train from a freshly initialised one. Training from an existing model is hard to reason about and hard to replicate, because you're starting from a fairly arbitrary intermediate state. The replication problem is especially important: if you're always training on top of a previous model, then in order to replicate your work you would have to retrain in several steps, each time taking care to use exactly the data you used. This will probably be infeasible, which means that you're likely to end up with a model you can't recreate.

Thanks for your help!

I was confused because I thought that I had to train the model again and again.

What I am doing is:

  1. Create a first model with textcat.batch-train using existing labelled data and en_vectors_web_lg. Lets say that the output model for this recipe is model_pretrained_v18.
  2. I annotate manually using textcat.teach and model_pretrained_v18.
  3. I know the model_pretrained_v18 has been updated on the loop in the previous step but I thought I had to run anyway the textcat.batch-train again to get a more accurate model thats why I am running textcat.batch-train data_projects_pretrained_v18 model_projects_pretrained_v18 --output model_projects_pretrained_v18_v2
    because I wanted to go to step 2 and annotate again on the resulting model_projects_pretrained_v18_v2.

If I understood properly, the model is being updated while I am annotating and I don't need to run textcat.batch-train again, lets say after 1000 annotations to get a more accurate model to continue annotating using this "new version".

If I go back and train the model using en_vectors_web_lg is like I am loosing the updates of the model while annotating, isn't it?

Thanks for clarifying!

So long as you're training the model with all of the data, there's no need to worry about "losing the updates". It's fine to train again from the initial state, just with more data. The model will relearn what it had previously, but with the extra examples as well.

You can think of this the same way as learning from a large benchmark corpus like mnist or the imdb data. You wouldn't train those models by first learning on 10% of the data, and then on the next 10%, etc. Instead you just start from a random model and make several passes over the whole dataset. The same idea applies when you're running the textcat.batch-train.

The only exception comes when you want to use information from a pretrained model and you don't have the original data available. This happens when you're adding an entity type to an NER model, for instance. But if you do have all the data, you'll want to start from a random initialisation.

Great! It is much more clear now.

Thanks for your help again! :slight_smile: