spacy pretrain TypeError

Hi,

I am trying to train a NER model using prodigy annotations and spacy. This is the spacy validate output for your reference

====================== Installed models (spaCy v2.2.4) ======================
ℹ spaCy installation:
/home/dileep/miniconda3/envs/myenv/lib/python3.7/site-packages/spacy

TYPE      NAME                MODEL               VERSION                            
package   en-vectors-web-lg   en_vectors_web_lg   2.1.0   ✔
package   en-core-web-sm      en_core_web_sm      2.2.5   ✔
package   en-core-web-md      en_core_web_md      2.2.5   ✔

I used the following commnad for pretraining

python -m spacy pretrain ./dx_med.jsonl en_vectors_web_lg models/ --use-vectors

But I am getting this strange error

  File "/home/dileep/miniconda3/envs/myenv/lib/python3.7/site-packages/spacy/cli/pretrain.py", line 235, in pretrain
    min_length=min_length,
  File "/home/dileep/miniconda3/envs/myenv/lib/python3.7/site-packages/spacy/cli/pretrain.py", line 286, in make_docs
    doc = Doc(nlp.vocab, words=words)
  File "doc.pyx", line 223, in spacy.tokens.doc.Doc.__init__
TypeError: Expected unicode, got dict

I tried searching in the internet and prodigy/spacy support.. no one else seems to have had similar error. Would greatly appreciate if someone could help troubleshoot this.

Thank you.

I wanted to add with this

prodigy ner.manual dx_med en_core_web_sm ./data.json --label LABEL1, LABEL2, LABEL3, LABEL4

This is the command I used for annotating the dataset. Here I have used

en_core_web_lg

while in the pretrain I'm using en_vectors_web_sm

Could this be causing the problem?

################
Follow up:

I tried annotating a few samples using blank:en model instead of en_core_web_sm. I am still getting the error.

The error message indicates that the pretrain command is trying to create Doc objects from texts that are not strings. So my guess is that the data you're loading for pretraining might have the wrong format? You can see an example of the expected format here: https://spacy.io/api/cli#pretrain-jsonl

1 Like