Error in prodigy data-to-spacy command

Horlando · August 17, 2021, 12:51pm

First of all thanks for the excellent tool.

I'm new to spacy, prodigy and AI as a whole.

When I finished training in prodigy, I wanted to export a model to be used in spacy, and command

prodigy data-to-spacy ./export_to_spacy/teste.json --lang pt --ner mydataset--base-model pt_core_news_lg

is showing the following error:

Created and merged data for 43 total examples

Type Total Merged
---- ----- ------
NER 47 43

Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/azureuser/prodigy/lib/python3.6/site-packages/prodigy/__main__.py", line 53, in <module>
    controller = recipe(*args, use_plac=True)
  File "cython_src/prodigy/core.pyx", line 321, in prodigy.core.recipe.recipe_decorator.recipe_proxy
  File "/home/azureuser/prodigy/lib/python3.6/site-packages/plac_core.py", line 367, in call
    cmd, result = parser.consume(arglist)
  File "/home/azureuser/prodigy/lib/python3.6/site-packages/plac_core.py", line 232, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "/home/azureuser/prodigy/lib/python3.6/site-packages/prodigy/recipes/train.py", line 345, in data_to_spacy
    json_data = [docs_to_json([doc], id=i) for i, doc in docs]
  File "/home/azureuser/prodigy/lib/python3.6/site-packages/prodigy/recipes/train.py", line 345, in <listcomp>
    json_data = [docs_to_json([doc], id=i) for i, doc in docs]
  File "gold.pyx", line 881, in spacy.gold.docs_to_json
  File "doc.pyx", line 652, in sents
ValueError: [E030] Sentence boundaries unset. You can add the 'sentenizer' component to the pipeline with: nlp.add_pipe(nlp.create_pipe('sentencizer')) Alternatively, add the dependency parser, or set sentence boundaries by setting doc[i].is_sent_start.

I've done a lot of research and still haven't figured out how to fix it.

ines · August 17, 2021, 11:52pm

Hi! Could you try running it without the --base-model pt_core_news_lg set? Since you're only exporting the data for spaCy, the base model here shouldn't be that necessary and it's mostly relevant if you have a custom model with custom tokenization or sentence segmentation.

I think the underlying problem here is that the sentence segmentation performed by the parser doesn't run during the export, so spaCy complains. This should also be resolved in the new Prodigy v1.11 and spaCy v3. (Note that the usage of the data-to-spacy command here is slightly different, and spaCy v3 isn't compatible with spaCy v2. So if you want to upgrade, I'd recommend using a fresh virtual environment.)

Horlando · August 19, 2021, 1:14pm

Understand. I ran without the --base-model pt_core_news_lg and it works, thanks!

SofieVL · August 19, 2021, 7:44pm

Great to hear it, thanks for reporting back!

Topic		Replies	Views
Error when using data-to-spacy done , nightly	3	470	June 28, 2021
data-to-spacy command error spacy	1	560	January 8, 2020
Error training with a model done , spacy , training	7	699	August 26, 2021
Command "ner.batch-train" returns MemoryError ner , solved	5	826	August 22, 2019
ner correct with prodigy 1.11.8 ner	11	533	December 30, 2022

Error in prodigy data-to-spacy command

Related topics