Error while trying to train: 'utf-8' codec can't decode

I am running the following code to train:

!Python -m prodigy train ner --textcat-multilabel food_annotations --base-model en_vectors_web_lg — init-tok2vec ./tok2vec_cd8_model289.bin — output ./tmp_model — eval-split 0.2

food_annotations is the annotations dataset created by prodigy

When running it, I am getting this error:

I'm not sure exactly what the error is arriving from, I would really appreciate any help!

Hi @saad.moosa ,

For the UnicodeDecode error, it usually happens when there is something unusual in your terminal encoding settings. Under the hood, Python tries to decode text according to UTF-8 rules. When a particular byte doesn't follow such rule, it throws this error.

Also, you might want to check your command again. If you're training a NER model, you probably want to do something like this:

prodigy train --ner <NER dataset>\ 
              --textcat-multilabel <TCM dataset> \
              --eval-split 0.2 \
              # your other config ... 
              # <OUTPUT_DIR>

Maybe the reason why it errors out is because we're inadvertently passing a non-UTF8 file? To be sure, you can check prodigy train --help for more information.

Thank you! You were right, the mistake was in the command. I have reformatted the command to

!Python -m prodigy train ./tmp_model --ner food_annotations --base-model en_core_web_lg --eval-split 0.25

and it worked.
I looked through the documentation to see if i could add pretrained token-to-vector weights (from spacy pretrain) but i could not find any guidance there to how to add it to the command. Can you advice me in regard to this? I saw the old way to add it would be to add the command as "init-tok2vec ./tok2vec_cd8_model289.bin" but this does not seem to work now.

This time, it now goes into your spaCy configuration file, specifically under the [initialize] section. Then you can pass that config in the --config parameter of the train command. The benefit of doing so is that you can configure your initialization step and other parameters into one file, they just now "live" in one file, and you don't need to pass a lot in your CLI command.

Thank you! This helped alot!