data-to-spacy command error

Jyothi · January 7, 2020, 2:06pm

Hi,
I have recently upgraded my python to 1.9.3 and using spacy 2.2.3. I used to use convert command to get spacy format to train.

python -m spacy convert -l en -t json -c jsonl prodigy-data.jsonl /spacy_dir

After installed new version of prodigy, used data-to-spacy command to get spacy format.

python -m prodigy data-to-spacy .\spacy_dataset\train-data.json .\spacy_dataset\eval-data.json --lang en --ner data_shuffled_cleaned

data_shuffled_cleaned is the dataset which I was successful with prodigy train CLI with 90% overall accuracy. The same dataset I am trying to use for spacy train CLI.

So i ran spacy train command but getting below error. May I know why?

python -m spacy train en model train-data.json eval-data.json --pipeline ner -v en_vectors_web_lg --verbose

Traceback (most recent call last):

File "*************/runpy.py", line 193, in _run_module_as_main

"main", mod_spec)

File "************/python3/lib/python3.6/runpy.py", line 85, in _run_code

exec(code, run_globals)

File "*******/python3/lib/python3.6/site-packages/spacy/main.py", line 33, in

plac.call(commands[command], sys.argv[1:])

File "***********/python3/lib/python3.6/site-packages/plac_core.py", line 367, in call

cmd, result = parser.consume(arglist)

File *******/python3/lib/python3.6/site-packages/plac_core.py", line 232, in consume

return cmd, self.func(*(args + varargs + extraopts), **kwargs)

File "************/python3/lib/python3.6/site-packages/spacy/cli/train.py", line 230, in train

corpus = GoldCorpus(train_path, dev_path, limit=n_examples)

File "gold.pyx", line 224, in spacy.gold.GoldCorpus.init

File "gold.pyx", line 235, in spacy.gold.GoldCorpus.write_msgpack

File "gold.pyx", line 280, in read_tuples

File "gold.pyx", line 545, in read_json_file

File "gold.pyx", line 592, in _json_iterate

OverflowError: value too large to convert to int

adriane · January 8, 2020, 4:44pm

Sorry about that! That's a known bug when the number of documents is too large for an int. It will be fixed in the next version of spacy.

The easiest solution with spacy v2.2.3 is to split your training file into multiple .json files. If you provide a directory instead of a file as the train or dev path, spacy train will find all the .json files within that directory (also in all the subdirectories).

Topic		Replies	Views
SpaCy training from data-to-spacy output ? usage , spacy	8	1812	June 14, 2022
Error in prodigy data-to-spacy command ner , spacy , solved	3	361	August 19, 2021
unable to convert prodigy jsonl to spacy training json usage , spacy	3	1460	June 26, 2020
How to convert prodigy dataset to .spacy object? usage , spacy , solved	6	1299	January 13, 2023
Converting SpaCy training json file to Prodigy jsonl format usage , spacy	9	3013	April 17, 2023

data-to-spacy command error

Related topics