data-to-spacy command error

Hi,
I have recently upgraded my python to 1.9.3 and using spacy 2.2.3. I used to use convert command to get spacy format to train.

python -m spacy convert -l en -t json -c jsonl prodigy-data.jsonl /spacy_dir

After installed new version of prodigy, used data-to-spacy command to get spacy format.

python -m prodigy data-to-spacy .\spacy_dataset\train-data.json .\spacy_dataset\eval-data.json --lang en --ner data_shuffled_cleaned

data_shuffled_cleaned is the dataset which I was successful with prodigy train CLI with 90% overall accuracy. The same dataset I am trying to use for spacy train CLI.

So i ran spacy train command but getting below error. May I know why?

python -m spacy train en model train-data.json eval-data.json --pipeline ner -v en_vectors_web_lg --verbose

Traceback (most recent call last):

File "*************/runpy.py", line 193, in _run_module_as_main

"main", mod_spec)

File "************/python3/lib/python3.6/runpy.py", line 85, in _run_code

exec(code, run_globals)

File "*******/python3/lib/python3.6/site-packages/spacy/main.py", line 33, in

plac.call(commands[command], sys.argv[1:])

File "***********/python3/lib/python3.6/site-packages/plac_core.py", line 367, in call

cmd, result = parser.consume(arglist)

File *******/python3/lib/python3.6/site-packages/plac_core.py", line 232, in consume

return cmd, self.func(*(args + varargs + extraopts), **kwargs)

File "************/python3/lib/python3.6/site-packages/spacy/cli/train.py", line 230, in train

corpus = GoldCorpus(train_path, dev_path, limit=n_examples)

File "gold.pyx", line 224, in spacy.gold.GoldCorpus.init

File "gold.pyx", line 235, in spacy.gold.GoldCorpus.write_msgpack

File "gold.pyx", line 280, in read_tuples

File "gold.pyx", line 545, in read_json_file

File "gold.pyx", line 592, in _json_iterate

OverflowError: value too large to convert to int

Sorry about that! That's a known bug when the number of documents is too large for an int. It will be fixed in the next version of spacy.

The easiest solution with spacy v2.2.3 is to split your training file into multiple .json files. If you provide a directory instead of a file as the train or dev path, spacy train will find all the .json files within that directory (also in all the subdirectories).