Hi I am a newb to spacy and prodigy. I want to train a named entity recognizer and followed the tutorial video that Ines uploaded on youtube. What I want to try building is to label full time or part time in a job advertisement.
But instead of Sens2vec I downloaded english fasttext .zip file (common crawl without sub word information)
Then I got an AttributeError when I tried to do prodigy train ner
.
Here are the commands that I used.
- I initialized vector model using spacy init-model
spacy init-model en ft_common_crawl --vectors-loc crawl-300d-2M.vec.zip
- Then I pretrain the model with .jsonl file that I created.
The original file was .txt file containing a number of sentences (one sentence per line)
Then I converted it to .jsonl format using function in this thread (jsonl format - #2 by ines)
So data is 1) job_all.jsonl 2)jobs_sample.jsonl
python -m spacy pretrain job_all.jsonl ./ft_common_crawl ./ft_common_crawl/pretrain --dropout 0.3 --batch-size 16 --n-iter 500
- I did terms.teach with the custom model path that I created at the first step
prodigy terms.teach fulltime_terms ./ft_common_crawl --seeds "fulltime, full time, full-time, whole day, regular job"
prodigy terms.teach parttime_terms ./ft_common_crawl --seeds "parttime, part time, part-time, side job, extra job"
prodigy terms.to-patterns fulltime_terms fulltime_pattersn.jsonl --label FULLTIME --spacy-model blank:en
prodigy terms.to-patterns parttime_terms parttime_pattersn.jsonl --label PARTTIME --spacy-model blank:en
prodigy db-in fulltime_patterns fulltime_pattersn.jsonl
prodigy db-in parttime_patterns parttime_pattersn.jsonl
prodigy db-merge fulltime_patterns,parttime_patterns full_part_patterns
prodigy db-out full_part_patterns full_part_patterns.jsonl
- Then I did manual annotation 500 sample sentences.
prodigy ner.manual full_part_jobs blank:en job_sample.jsonl --label FULLTIME,PARTTIME --patterns full_part_terms.jsonl
- Then I try to pre train a NER model using prodigy train ner.
prodigy train ner full_part_jobs ./ft_common_crawl --init-tok2vec ./ft_common_crawl/pretrain/model499.bin --output ./tmp_model --eval-split 0.2
Then it complaint like belows:
(base) sujoungbaeck@Sujoungs-MacBook-Pro sujoung % ! prodigy train ner full_part_jobs ./ft_common_crawl --init-tok2vec ./ft_common_crawl/pretrain/model499.bin --output ./tmp_model --eval-split 0.2 ✔ Loaded model './ft_common_crawl' Created and merged data for 387 total examples Using 310 train / 77 eval (split 20%) Component: ner | Batch size: compounding | Dropout: 0.2 | Iterations: 10 ✔ Initializing with tok2vec weights ./ft_common_crawl/pretrain/model499.bin embed_rows: 2000 | require_vectors: False | cnn_maxout_pieces: 3 | token_vector_width: 96 | conv_depth: 4 | nr_feature_tokens: 3 | pretrained_vectors: en_model.vectors | pretrained_dims: 300 Traceback (most recent call last): File "/Users/sujoungbaeck/anaconda3/lib/python3.7/runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "/Users/sujoungbaeck/anaconda3/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/Users/sujoungbaeck/anaconda3/lib/python3.7/site-packages/prodigy/__main__.py", line 60, in <module> controller = recipe(*args, use_plac=True) File "cython_src/prodigy/core.pyx", line 213, in prodigy.core.recipe.recipe_decorator.recipe_proxy File "/Users/sujoungbaeck/anaconda3/lib/python3.7/site-packages/plac_core.py", line 367, in call cmd, result = parser.consume(arglist) File "/Users/sujoungbaeck/anaconda3/lib/python3.7/site-packages/plac_core.py", line 232, in consume return cmd, self.func(*(args + varargs + extraopts), **kwargs) File "/Users/sujoungbaeck/anaconda3/lib/python3.7/site-packages/prodigy/recipes/train.py", line 139, in train load_pretrained_tok2vec(pipe, init_tok2vec, require=True) File "cython_src/prodigy/util.pyx", line 520, in prodigy.util.load_pretrained_tok2vec File "/Users/sujoungbaeck/anaconda3/lib/python3.7/site-packages/thinc/neural/_classes/model.py", line 395, in from_bytes dest = getattr(layer, name) AttributeError: 'FunctionLayer' object has no attribute 'vectors'
My careful guess is that I didn't pass the JSONL formatted vocabulary file when i did spacy init-model.
But although the document said that the vocab command outputs a ready-to-use file, I really don't get exactly how to get the .jsonl file. I tried to_disk but it didn't work. It generates vocab folder containing "key2row", "lexemes.bin", "strings.json", "vectors" that I already had in the initialized model directory.
How can I solve the AttirbutionError problem?