I had a similar issue but the cause was different. We hit the limit of the blob size in MySQL.
The command I ran:
prodigy train ner hc_18052020_GOLD data/ecommerce/2020_04/tmp_model_nc_150520_v2 --eval-split 0.3 --n-iter 50 --output /data/ecommerce/2020_04/hc_18052020_gold
The error I got:
âś” Loaded model
'/home/ubuntu/prodigy/data/ecommerce/2020_04/tmp_model_nonclaims_150520_v2'
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/prodigy/lib/python3.8/runpy.py", line 193, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/ubuntu/anaconda3/envs/prodigy/lib/python3.8/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/ubuntu/anaconda3/envs/prodigy/lib/python3.8/site-packages/prodigy/__main__.py", line 60, in <module>
controller = recipe(*args, use_plac=True)
File "cython_src/prodigy/core.pyx", line 213, in prodigy.core.recipe.recipe_decorator.recipe_proxy
File "/home/ubuntu/anaconda3/envs/prodigy/lib/python3.8/site-packages/plac_core.py", line 367, in call
cmd, result = parser.consume(arglist)
File "/home/ubuntu/anaconda3/envs/prodigy/lib/python3.8/site-packages/plac_core.py", line 232, in consume
return cmd, self.func(*(args + varargs + extraopts), **kwargs)
File "/home/ubuntu/anaconda3/envs/prodigy/lib/python3.8/site-packages/prodigy/recipes/train.py", line 103, in train
data, labels = merge_data(nlp, **merge_cfg)
File "/home/ubuntu/anaconda3/envs/prodigy/lib/python3.8/site-packages/prodigy/recipes/train.py", line 359, in merge_data
ner_examples = load_examples(DB, ner_datasets)
File "/home/ubuntu/anaconda3/envs/prodigy/lib/python3.8/site-packages/prodigy/recipes/train.py", line 528, in load_examples
examples = db.get_dataset(set_id) or []
File "/home/ubuntu/anaconda3/envs/prodigy/lib/python3.8/site-packages/prodigy/components/db.py", line 337, in get_dataset
return [eg.load() for eg in examples]
File "/home/ubuntu/anaconda3/envs/prodigy/lib/python3.8/site-packages/prodigy/components/db.py", line 337, in <listcomp>
return [eg.load() for eg in examples]
File "/home/ubuntu/anaconda3/envs/prodigy/lib/python3.8/site-packages/prodigy/components/db.py", line 99, in load
return srsly.json_loads(content)
File "/home/ubuntu/anaconda3/envs/prodigy/lib/python3.8/site-packages/srsly/_json_api.py", line 38, in json_loads
return ujson.loads(data)
ValueError: Unexpected character in found when decoding object value
I was unable to load a dataset for modeling, which was previously saved by prodigy to the MySQL database. On checking the individual annotations, I found that the size of the text plus annotations exceeded 65KB (the max size for LOB store in MySQL) due to which MySQL truncated the JSON, resulting in a bad JSON record.
This was not obvious to me when saving the annotation for the text, but the issue appeared when I tried to load the saved annotation to build a model.
This was how my JSON looked like after the annotation.
{"text":"pure nv balancing conditioner: infused with argan oil, keratin, collagen, natural vitamins, and lavender for smoother, more manageable hair- sulfate & sodium chloride free (33.8 oz bottle). deep conditions-pure nv balancing conditioner is formulated to improve the appearance and feel of your hair by moisturizing your dry damaged locks building body, improving luster and ..", "spans":[...], ..., "tokens": [..., {"text":"hair"
The JSON ended abruptly. When I looked at the size of the JSON, it was 65535 characters.
Is there a work around for this? Should I break up the size of the input documents? Or move to a different db?