Basic question about Prodigy annotations and model training.

@ines,

I am using Prodigy to train models for new entities, text categories etc. I would like to keep the ‘en’ language model’s information, but augment it with what I annotate and train the model with network (domain-specific) information, primarily because I need it to retain its original POS tagging intelligence.

When I use Prodigy to annotate, let us say a new text category (what I am doing right now, hence the example), it annotates, trains and saves the model to /tmp/model. In doing so, it loses all the intelligence it had, and I no longer see the POS tagging capability of the ‘en’ model. I know we can disable pipeline components when training in Spacy, but is there a way to prevent this happening in Prodigy? If possible, I would like to keep whatever intelligence the base model comes with and keep adding to that body with my annotation and training.

Thanks in advance. :slight_smile:

Hi! Are you using the latest version of Prodigy? In v1.6.1, the ner.batch-train recipe should restore all disabled pipeline components at the end and save out the full updated model. (During training, all other components are disabled, because we serialize a copy of the model on each epoch – so this should be as small as possible.)

In general, you can always mix and match the components and it should even be no problem to copy-paste them around if you need to. So you can take the parser directory from the base model, copy it over to your new model and make sure the meta.json lists "parser" in its pipeline.

Hi @ines,
Sorry about the late reply. Yes, I am using v1.6.1, but I was using the textcat.batch-train recipe and it somehow does not seem to save the other components to the folder. I did what you suggested and that worked, so the problem is solved. Not sure why ner.batch-train and textcat.batch-train are behaving differently. Will get back in case of any further problems. Thanks a ton!

Hi @ines,
This is the issue I am running into now after copying over the folders from the “en” model into my model’s directory.

  File "test_code.py", line 3, in <module>
    nlp = spacy.load('/tmp/model')
  File "/Users/Abhishek/Projects/Git-Repositories/spaCy/spacy/__init__.py", line 21, in load
    return util.load_model(name, **overrides)
  File "/Users/Abhishek/Projects/Git-Repositories/spaCy/spacy/util.py", line 116, in load_model
    return load_model_from_path(Path(name), **overrides)
  File "/Users/Abhishek/Projects/Git-Repositories/spaCy/spacy/util.py", line 156, in load_model_from_path
    return nlp.from_disk(model_path)
  File "/Users/Abhishek/Projects/Git-Repositories/spaCy/spacy/language.py", line 647, in from_disk
    util.from_disk(path, deserializers, exclude)
  File "/Users/Abhishek/Projects/Git-Repositories/spaCy/spacy/util.py", line 511, in from_disk
    reader(path / key)
  File "/Users/Abhishek/Projects/Git-Repositories/spaCy/spacy/language.py", line 643, in <lambda>
    deserializers[name] = lambda p, proc=proc: proc.from_disk(p, vocab=False)
  File "pipeline.pyx", line 643, in spacy.pipeline.Tagger.from_disk
  File "/Users/Abhishek/Projects/Git-Repositories/spaCy/spacy/util.py", line 511, in from_disk
    reader(path / key)
  File "pipeline.pyx", line 625, in spacy.pipeline.Tagger.from_disk.load_model
  File "pipeline.pyx", line 534, in spacy.pipeline.Tagger.Model
ValueError: [T008] Bad configuration of Tagger. This is probably a bug within spaCy. We changed the name of an internal attribute for loading pre-trained vectors, and the class has been passed the old name (pretrained_dims) but not the new name (pretrained_vectors).

Not sure of what happened, but different errors occur if I remove Tagger from the pipeline. Parser results in a thincc error and so on. v1.6.1's textcat.batch-train definitely does not enable the pipeline components again. I ran the recipe after copying the folders over and when it overwrote the meta.json for the model, it removed all the other components except sbd and textcat, which is what I was training on.

I’m definitely confused here, because the code really does seem to be doing the right thing. I haven’t worked through a reproduction yet, so maybe there’s something I’m missing.

It looks to me like you’re running a local fork of spaCy. Is that intentional? What version do you have there? If it’s not intentional, try setting export PYTHONPATH="" to avoid importing from there, so that you import from the version of spaCy that should be installed alongside Prodigy.

@honnibal,
HI Matthew, yes I am running a local fork of Spacy since I need some custom tokenization for my work. I found that (or did not find any information, for that matter) Spacy’s regex matching capabilities when you define a TOKEN_MATCH is restricted to only a single regex. I have a requirement where I need to identify a number of patterns, so I just modified tokenizer.pyx in Spacy to account for an iterable being passed to TOKEN_MATCH. I should say that I comment this line and go back to the original when I need to use Prodigy because it does not work with that change (obviously). That is the reason behind using a local fork of Spacy.

Coming to the problem, I changed the installation of Spacy to a local installation in the same Python VE where Prodigy is running and I get the same error. Also cleared out PYTHONPATH.

Traceback (most recent call last):
  File "test_code.py", line 3, in <module>
    nlp = spacy.load('/tmp/model')
  File "/Users/Abhishek/Projects/Python-Projects/Python-VEs/NM-NLP/lib/python3.6/site-packages/spacy/__init__.py", line 21, in load
    return util.load_model(name, **overrides)
  File "/Users/Abhishek/Projects/Python-Projects/Python-VEs/NM-NLP/lib/python3.6/site-packages/spacy/util.py", line 116, in load_model
    return load_model_from_path(Path(name), **overrides)
  File "/Users/Abhishek/Projects/Python-Projects/Python-VEs/NM-NLP/lib/python3.6/site-packages/spacy/util.py", line 156, in load_model_from_path
    return nlp.from_disk(model_path)
  File "/Users/Abhishek/Projects/Python-Projects/Python-VEs/NM-NLP/lib/python3.6/site-packages/spacy/language.py", line 647, in from_disk
    util.from_disk(path, deserializers, exclude)
  File "/Users/Abhishek/Projects/Python-Projects/Python-VEs/NM-NLP/lib/python3.6/site-packages/spacy/util.py", line 511, in from_disk
    reader(path / key)
  File "/Users/Abhishek/Projects/Python-Projects/Python-VEs/NM-NLP/lib/python3.6/site-packages/spacy/language.py", line 643, in <lambda>
    deserializers[name] = lambda p, proc=proc: proc.from_disk(p, vocab=False)
  File "pipeline.pyx", line 643, in spacy.pipeline.Tagger.from_disk
  File "/Users/Abhishek/Projects/Python-Projects/Python-VEs/NM-NLP/lib/python3.6/site-packages/spacy/util.py", line 511, in from_disk
    reader(path / key)
  File "pipeline.pyx", line 625, in spacy.pipeline.Tagger.from_disk.load_model
  File "pipeline.pyx", line 534, in spacy.pipeline.Tagger.Model
ValueError: [T008] Bad configuration of Tagger. This is probably a bug within spaCy. We changed the name of an internal attribute for loading pre-trained vectors, and the class has been passed the old name (pretrained_dims) but not the new name (pretrained_vectors).

It would be extremely helpful if you could point me to any place I can write a custom tokenizer with multiple pattern matching capabilities, so that I can avoid this problem entirely.

Ah, understood. I’m pretty sure I’ll be able to help you get that working without running a fork. Let’s focus on the textcat problem first though.

Could you paste the contents of the meta.json in the output model? Also, can you run python -m spacy validate and paste the output? I just want to check that the right version of the model is there.

Spacy Validate:

  [Abhishek:~/Projects/Git-Repositories/spaCy] [NM-NLP] master(+25/-6) 4s ± python -m spacy download en
Requirement already satisfied: en_core_web_sm==2.0.0 from https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz#egg=en_core_web_sm==2.0.0 in /Users/Abhishek/Projects/Python-Projects/Python-VEs/NM-NLP/lib/python3.6/site-packages (2.0.0)

    Linking successful
    /Users/Abhishek/Projects/Python-Projects/Python-VEs/NM-NLP/lib/python3.6/site-packages/en_core_web_sm
    --> /Users/Abhishek/Projects/Git-Repositories/spaCy/spacy/data/en

    You can now load the model via spacy.load('en')

[Abhishek:~/Projects/Git-Repositories/spaCy] [NM-NLP] master(+25/-6) 4s ± python -m spacy validate

    Installed models (spaCy v2.0.18)
    /Users/Abhishek/Projects/Git-Repositories/spaCy/spacy

    TYPE        NAME                  MODEL                 VERSION
    package     en-core-web-sm        en_core_web_sm        2.0.0    ✔
    package     en-core-web-lg        en_core_web_lg        2.0.0    ✔
    link        en_core_web_lg        en_core_web_lg        2.0.0    ✔
    link        en_core_web_sm        en_core_web_sm        2.0.0    ✔
    link        en                    en_core_web_sm        2.0.0    ✔

meta.json contents:

{
  "lang":"en",
  "name":"model",
  "version":"0.0.0",
  "spacy_version":">=2.0.18",
  "description":"",
  "author":"",
  "email":"",
  "url":"",
  "license":"",
  "vectors":{
    "width":0,
    "vectors":0,
    "keys":0,
    "name":null
  },
  "pipeline":[
    "sbd",
    "textcat"
  ]
}

I added in “tagger”, “parser” and “ner” as @ines directed me to, but everytime I train, it is rewritten to the above. I have not tried it with ner.batch-train, but this is with textcat.batch-train.

Thanks! I really feel like there must be something I’m missing. Could you give the command you’re running to textcat.batch-train?

prodigy textcat.batch-train event_labels --output-model /tmp/model --eval-split 0.8

I am actually following most of the commands from the documentation. :slight_smile:

And what I am using to annotate a new label.

prodigy textcat.teach event_labels en_core_web_lg /Users/Abhishek/Downloads/training_dataset.txt --label LOGIN_RELATED_FAIL

Also, do you want me to create a new thread for the tokenization problem?

Aha! Okay you’re missing the second positional argument, which specifies the input model. Try:

prodigy textcat.batch-train event_labels en_core_web_sm --output-model /tmp/model --eval-split 0.8

Without the en_core_web_md argument, the model was defaulting to a blank model, which didn’t have the POS tagger loaded.

Yes, I think a new thread for the tokenization problem will be best. I’ll give you some code for it, so it’s best if it’s easily searchable.

That was such a stupid mistake. I cannot believe I missed something as elementary as that. My apologies. Thank you so much and I will create a new thread for the tokenization question. You folks are amazing!

1 Like