Basic question about Prodigy annotations and model training.

adwaraki · January 16, 2019, 7:57am

I am using Prodigy to train models for new entities, text categories etc. I would like to keep the ‘en’ language model’s information, but augment it with what I annotate and train the model with network (domain-specific) information, primarily because I need it to retain its original POS tagging intelligence.

When I use Prodigy to annotate, let us say a new text category (what I am doing right now, hence the example), it annotates, trains and saves the model to /tmp/model. In doing so, it loses all the intelligence it had, and I no longer see the POS tagging capability of the ‘en’ model. I know we can disable pipeline components when training in Spacy, but is there a way to prevent this happening in Prodigy? If possible, I would like to keep whatever intelligence the base model comes with and keep adding to that body with my annotation and training.

Thanks in advance.

ines · January 16, 2019, 11:59am

Hi! Are you using the latest version of Prodigy? In v1.6.1, the ner.batch-train recipe should restore all disabled pipeline components at the end and save out the full updated model. (During training, all other components are disabled, because we serialize a copy of the model on each epoch – so this should be as small as possible.)

In general, you can always mix and match the components and it should even be no problem to copy-paste them around if you need to. So you can take the parser directory from the base model, copy it over to your new model and make sure the meta.json lists "parser" in its pipeline.

adwaraki · January 17, 2019, 12:48am

Hi @ines,
Sorry about the late reply. Yes, I am using v1.6.1, but I was using the textcat.batch-train recipe and it somehow does not seem to save the other components to the folder. I did what you suggested and that worked, so the problem is solved. Not sure why ner.batch-train and textcat.batch-train are behaving differently. Will get back in case of any further problems. Thanks a ton!

adwaraki · January 17, 2019, 7:11am

Hi @ines,
This is the issue I am running into now after copying over the folders from the “en” model into my model’s directory.

  File "test_code.py", line 3, in <module>
    nlp = spacy.load('/tmp/model')
  File "/Users/Abhishek/Projects/Git-Repositories/spaCy/spacy/__init__.py", line 21, in load
    return util.load_model(name, **overrides)
  File "/Users/Abhishek/Projects/Git-Repositories/spaCy/spacy/util.py", line 116, in load_model
    return load_model_from_path(Path(name), **overrides)
  File "/Users/Abhishek/Projects/Git-Repositories/spaCy/spacy/util.py", line 156, in load_model_from_path
    return nlp.from_disk(model_path)
  File "/Users/Abhishek/Projects/Git-Repositories/spaCy/spacy/language.py", line 647, in from_disk
    util.from_disk(path, deserializers, exclude)
  File "/Users/Abhishek/Projects/Git-Repositories/spaCy/spacy/util.py", line 511, in from_disk
    reader(path / key)
  File "/Users/Abhishek/Projects/Git-Repositories/spaCy/spacy/language.py", line 643, in <lambda>
    deserializers[name] = lambda p, proc=proc: proc.from_disk(p, vocab=False)
  File "pipeline.pyx", line 643, in spacy.pipeline.Tagger.from_disk
  File "/Users/Abhishek/Projects/Git-Repositories/spaCy/spacy/util.py", line 511, in from_disk
    reader(path / key)
  File "pipeline.pyx", line 625, in spacy.pipeline.Tagger.from_disk.load_model
  File "pipeline.pyx", line 534, in spacy.pipeline.Tagger.Model
ValueError: [T008] Bad configuration of Tagger. This is probably a bug within spaCy. We changed the name of an internal attribute for loading pre-trained vectors, and the class has been passed the old name (pretrained_dims) but not the new name (pretrained_vectors).

Not sure of what happened, but different errors occur if I remove Tagger from the pipeline. Parser results in a thincc error and so on. v1.6.1's textcat.batch-train definitely does not enable the pipeline components again. I ran the recipe after copying the folders over and when it overwrote the meta.json for the model, it removed all the other components except sbd and textcat, which is what I was training on.

honnibal · January 17, 2019, 12:24pm

I’m definitely confused here, because the code really does seem to be doing the right thing. I haven’t worked through a reproduction yet, so maybe there’s something I’m missing.

It looks to me like you’re running a local fork of spaCy. Is that intentional? What version do you have there? If it’s not intentional, try setting export PYTHONPATH="" to avoid importing from there, so that you import from the version of spaCy that should be installed alongside Prodigy.

adwaraki · January 17, 2019, 3:23pm

@honnibal,
HI Matthew, yes I am running a local fork of Spacy since I need some custom tokenization for my work. I found that (or did not find any information, for that matter) Spacy’s regex matching capabilities when you define a TOKEN_MATCH is restricted to only a single regex. I have a requirement where I need to identify a number of patterns, so I just modified tokenizer.pyx in Spacy to account for an iterable being passed to TOKEN_MATCH. I should say that I comment this line and go back to the original when I need to use Prodigy because it does not work with that change (obviously). That is the reason behind using a local fork of Spacy.

Coming to the problem, I changed the installation of Spacy to a local installation in the same Python VE where Prodigy is running and I get the same error. Also cleared out PYTHONPATH.

Traceback (most recent call last):
  File "test_code.py", line 3, in <module>
    nlp = spacy.load('/tmp/model')
  File "/Users/Abhishek/Projects/Python-Projects/Python-VEs/NM-NLP/lib/python3.6/site-packages/spacy/__init__.py", line 21, in load
    return util.load_model(name, **overrides)
  File "/Users/Abhishek/Projects/Python-Projects/Python-VEs/NM-NLP/lib/python3.6/site-packages/spacy/util.py", line 116, in load_model
    return load_model_from_path(Path(name), **overrides)
  File "/Users/Abhishek/Projects/Python-Projects/Python-VEs/NM-NLP/lib/python3.6/site-packages/spacy/util.py", line 156, in load_model_from_path
    return nlp.from_disk(model_path)
  File "/Users/Abhishek/Projects/Python-Projects/Python-VEs/NM-NLP/lib/python3.6/site-packages/spacy/language.py", line 647, in from_disk
    util.from_disk(path, deserializers, exclude)
  File "/Users/Abhishek/Projects/Python-Projects/Python-VEs/NM-NLP/lib/python3.6/site-packages/spacy/util.py", line 511, in from_disk
    reader(path / key)
  File "/Users/Abhishek/Projects/Python-Projects/Python-VEs/NM-NLP/lib/python3.6/site-packages/spacy/language.py", line 643, in <lambda>
    deserializers[name] = lambda p, proc=proc: proc.from_disk(p, vocab=False)
  File "pipeline.pyx", line 643, in spacy.pipeline.Tagger.from_disk
  File "/Users/Abhishek/Projects/Python-Projects/Python-VEs/NM-NLP/lib/python3.6/site-packages/spacy/util.py", line 511, in from_disk
    reader(path / key)
  File "pipeline.pyx", line 625, in spacy.pipeline.Tagger.from_disk.load_model
  File "pipeline.pyx", line 534, in spacy.pipeline.Tagger.Model
ValueError: [T008] Bad configuration of Tagger. This is probably a bug within spaCy. We changed the name of an internal attribute for loading pre-trained vectors, and the class has been passed the old name (pretrained_dims) but not the new name (pretrained_vectors).

It would be extremely helpful if you could point me to any place I can write a custom tokenizer with multiple pattern matching capabilities, so that I can avoid this problem entirely.

honnibal · January 17, 2019, 4:37pm

Ah, understood. I’m pretty sure I’ll be able to help you get that working without running a fork. Let’s focus on the textcat problem first though.

Could you paste the contents of the meta.json in the output model? Also, can you run python -m spacy validate and paste the output? I just want to check that the right version of the model is there.

adwaraki · January 17, 2019, 5:10pm

Spacy Validate:

  [Abhishek:~/Projects/Git-Repositories/spaCy] [NM-NLP] master(+25/-6) 4s ± python -m spacy download en
Requirement already satisfied: en_core_web_sm==2.0.0 from https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz#egg=en_core_web_sm==2.0.0 in /Users/Abhishek/Projects/Python-Projects/Python-VEs/NM-NLP/lib/python3.6/site-packages (2.0.0)

    Linking successful
    /Users/Abhishek/Projects/Python-Projects/Python-VEs/NM-NLP/lib/python3.6/site-packages/en_core_web_sm
    --> /Users/Abhishek/Projects/Git-Repositories/spaCy/spacy/data/en

    You can now load the model via spacy.load('en')

[Abhishek:~/Projects/Git-Repositories/spaCy] [NM-NLP] master(+25/-6) 4s ± python -m spacy validate

    Installed models (spaCy v2.0.18)
    /Users/Abhishek/Projects/Git-Repositories/spaCy/spacy

    TYPE        NAME                  MODEL                 VERSION
    package     en-core-web-sm        en_core_web_sm        2.0.0    ✔
    package     en-core-web-lg        en_core_web_lg        2.0.0    ✔
    link        en_core_web_lg        en_core_web_lg        2.0.0    ✔
    link        en_core_web_sm        en_core_web_sm        2.0.0    ✔
    link        en                    en_core_web_sm        2.0.0    ✔

meta.json contents:

{
  "lang":"en",
  "name":"model",
  "version":"0.0.0",
  "spacy_version":">=2.0.18",
  "description":"",
  "author":"",
  "email":"",
  "url":"",
  "license":"",
  "vectors":{
    "width":0,
    "vectors":0,
    "keys":0,
    "name":null
  },
  "pipeline":[
    "sbd",
    "textcat"
  ]
}

I added in “tagger”, “parser” and “ner” as @ines directed me to, but everytime I train, it is rewritten to the above. I have not tried it with ner.batch-train, but this is with textcat.batch-train.

honnibal · January 17, 2019, 5:28pm

Thanks! I really feel like there must be something I’m missing. Could you give the command you’re running to textcat.batch-train?

adwaraki · January 17, 2019, 5:53pm

prodigy textcat.batch-train event_labels --output-model /tmp/model --eval-split 0.8

I am actually following most of the commands from the documentation.

And what I am using to annotate a new label.

prodigy textcat.teach event_labels en_core_web_lg /Users/Abhishek/Downloads/training_dataset.txt --label LOGIN_RELATED_FAIL

adwaraki · January 18, 2019, 7:19am

Also, do you want me to create a new thread for the tokenization problem?

honnibal · January 18, 2019, 12:45pm

Aha! Okay you’re missing the second positional argument, which specifies the input model. Try:

prodigy textcat.batch-train event_labels en_core_web_sm --output-model /tmp/model --eval-split 0.8

Without the en_core_web_md argument, the model was defaulting to a blank model, which didn’t have the POS tagger loaded.

Yes, I think a new thread for the tokenization problem will be best. I’ll give you some code for it, so it’s best if it’s easily searchable.

adwaraki · January 18, 2019, 2:06pm

That was such a stupid mistake. I cannot believe I missed something as elementary as that. My apologies. Thank you so much and I will create a new thread for the tokenization question. You folks are amazing!

Topic		Replies	Views
Labeling sequence labeling (e.g. NER) task from scratch ner , spacy	16	3496	October 22, 2017
Prodigy created model does not work usage , ner	2	742	November 9, 2018
Does Prodigy load pre-annotated data? usage , ner , solved	23	2644	October 25, 2018
Working with languages not yet supported by Spacy textcat , spacy , solved	18	7253	June 25, 2018
Prodigy annotations to SpaCy train spacy	13	5622	January 31, 2018

Basic question about Prodigy annotations and model training.

Related topics