JSONL formatted vocabulary file and init-model with fastText

Sujoung · March 27, 2020, 10:56pm

Hi I am a newb to spacy and prodigy. I want to train a named entity recognizer and followed the tutorial video that Ines uploaded on youtube. What I want to try building is to label full time or part time in a job advertisement.

But instead of Sens2vec I downloaded english fasttext .zip file (common crawl without sub word information)
Then I got an AttributeError when I tried to do prodigy train ner.
Here are the commands that I used.

I initialized vector model using spacy init-model

spacy init-model en ft_common_crawl --vectors-loc crawl-300d-2M.vec.zip

Then I pretrain the model with .jsonl file that I created.
The original file was .txt file containing a number of sentences (one sentence per line)
Then I converted it to .jsonl format using function in this thread (jsonl format - #2 by ines)
So data is 1) job_all.jsonl 2)jobs_sample.jsonl

python -m spacy pretrain job_all.jsonl ./ft_common_crawl ./ft_common_crawl/pretrain --dropout 0.3 --batch-size 16 --n-iter 500

I did terms.teach with the custom model path that I created at the first step
prodigy terms.teach fulltime_terms ./ft_common_crawl --seeds "fulltime, full time, full-time, whole day, regular job"

prodigy terms.teach parttime_terms ./ft_common_crawl --seeds "parttime, part time, part-time, side job, extra job"

prodigy terms.to-patterns fulltime_terms fulltime_pattersn.jsonl --label FULLTIME --spacy-model blank:en

prodigy terms.to-patterns parttime_terms parttime_pattersn.jsonl --label PARTTIME --spacy-model blank:en

prodigy db-in fulltime_patterns fulltime_pattersn.jsonl

prodigy db-in parttime_patterns parttime_pattersn.jsonl

prodigy db-merge fulltime_patterns,parttime_patterns full_part_patterns

prodigy db-out full_part_patterns full_part_patterns.jsonl

Then I did manual annotation 500 sample sentences.

prodigy ner.manual full_part_jobs blank:en job_sample.jsonl --label FULLTIME,PARTTIME --patterns full_part_terms.jsonl

Then I try to pre train a NER model using prodigy train ner.

prodigy train ner full_part_jobs ./ft_common_crawl --init-tok2vec ./ft_common_crawl/pretrain/model499.bin --output ./tmp_model --eval-split 0.2

Then it complaint like belows:

(base) sujoungbaeck@Sujoungs-MacBook-Pro sujoung % ! prodigy train ner full_part_jobs ./ft_common_crawl --init-tok2vec ./ft_common_crawl/pretrain/model499.bin --output ./tmp_model --eval-split 0.2
✔ Loaded model './ft_common_crawl'
Created and merged data for 387 total examples
Using 310 train / 77 eval (split 20%)
Component: ner | Batch size: compounding | Dropout: 0.2 | Iterations: 10
✔ Initializing with tok2vec weights
./ft_common_crawl/pretrain/model499.bin
embed_rows: 2000 | require_vectors: False | cnn_maxout_pieces: 3 |
token_vector_width: 96 | conv_depth: 4 | nr_feature_tokens: 3 |
pretrained_vectors: en_model.vectors | pretrained_dims: 300
Traceback (most recent call last):
  File "/Users/sujoungbaeck/anaconda3/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/Users/sujoungbaeck/anaconda3/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/Users/sujoungbaeck/anaconda3/lib/python3.7/site-packages/prodigy/__main__.py", line 60, in <module>
    controller = recipe(*args, use_plac=True)
  File "cython_src/prodigy/core.pyx", line 213, in prodigy.core.recipe.recipe_decorator.recipe_proxy
  File "/Users/sujoungbaeck/anaconda3/lib/python3.7/site-packages/plac_core.py", line 367, in call
    cmd, result = parser.consume(arglist)
  File "/Users/sujoungbaeck/anaconda3/lib/python3.7/site-packages/plac_core.py", line 232, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "/Users/sujoungbaeck/anaconda3/lib/python3.7/site-packages/prodigy/recipes/train.py", line 139, in train
    load_pretrained_tok2vec(pipe, init_tok2vec, require=True)
  File "cython_src/prodigy/util.pyx", line 520, in prodigy.util.load_pretrained_tok2vec
  File "/Users/sujoungbaeck/anaconda3/lib/python3.7/site-packages/thinc/neural/_classes/model.py", line 395, in from_bytes
    dest = getattr(layer, name)
AttributeError: 'FunctionLayer' object has no attribute 'vectors'

My careful guess is that I didn't pass the JSONL formatted vocabulary file when i did spacy init-model.
But although the document said that the vocab command outputs a ready-to-use file, I really don't get exactly how to get the .jsonl file. I tried to_disk but it didn't work. It generates vocab folder containing "key2row", "lexemes.bin", "strings.json", "vectors" that I already had in the initialized model directory.

How can I solve the AttirbutionError problem?

honnibal · April 11, 2020, 10:38am

Hi @Sujoung,

Really sorry I missed your thread. You've done almost everything right, and thanks for writing such detailed notes to make the question easy to answer.

The problem is in the spacy pretrain command. The command can pretrain a model that either does or does not rely on features from the static word vectors at runtime. This option is controlled by the --use-vectors setting. I think if you change your spacy pretrain command like this, it should work:

python -m spacy pretrain \
  job_all.jsonl \
  ./ft_common_crawl \
  ./ft_common_crawl/pretrain \
  --dropout 0.3 \
  --batch-size 16 \
  --n-iter 500 \
  --use-vectors

Sujoung · April 30, 2020, 4:14pm

Sorry for the late reply. I just tried and it worked perfectly

Topic		Replies	Views
Initializing custom model for ner usage , ner	1	517	January 25, 2021
HTML to jsonl and NER task workflow usage , ner , solved	6	851	July 19, 2019
I have job.jsonl data. i want to annotate the data as ner.manual for name entity recognition im having problem while running as it shows the error messages as : ner , spacy	5	379	May 27, 2022
Training prodigy ner data through spacy usage , ner , spacy , solved	3	892	January 8, 2020
Convert output of spaCy PhraseMatcher to prodigy JSONL ner , spacy , solved	3	1142	May 3, 2021

JSONL formatted vocabulary file and init-model with fastText

Related topics