Error in prodigy data-to-spacy command

First of all thanks for the excellent tool.

I'm new to spacy, prodigy and AI as a whole.

When I finished training in prodigy, I wanted to export a model to be used in spacy, and command

prodigy data-to-spacy ./export_to_spacy/teste.json --lang pt --ner mydataset--base-model pt_core_news_lg

is showing the following error:

Created and merged data for 43 total examples

Type Total Merged
---- ----- ------
NER 47 43

Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/azureuser/prodigy/lib/python3.6/site-packages/prodigy/__main__.py", line 53, in <module>
    controller = recipe(*args, use_plac=True)
  File "cython_src/prodigy/core.pyx", line 321, in prodigy.core.recipe.recipe_decorator.recipe_proxy
  File "/home/azureuser/prodigy/lib/python3.6/site-packages/plac_core.py", line 367, in call
    cmd, result = parser.consume(arglist)
  File "/home/azureuser/prodigy/lib/python3.6/site-packages/plac_core.py", line 232, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "/home/azureuser/prodigy/lib/python3.6/site-packages/prodigy/recipes/train.py", line 345, in data_to_spacy
    json_data = [docs_to_json([doc], id=i) for i, doc in docs]
  File "/home/azureuser/prodigy/lib/python3.6/site-packages/prodigy/recipes/train.py", line 345, in <listcomp>
    json_data = [docs_to_json([doc], id=i) for i, doc in docs]
  File "gold.pyx", line 881, in spacy.gold.docs_to_json
  File "doc.pyx", line 652, in sents
ValueError: [E030] Sentence boundaries unset. You can add the 'sentenizer' component to the pipeline with: nlp.add_pipe(nlp.create_pipe('sentencizer')) Alternatively, add the dependency parser, or set sentence boundaries by setting doc[i].is_sent_start.

I've done a lot of research and still haven't figured out how to fix it.

Hi! Could you try running it without the --base-model pt_core_news_lg set? Since you're only exporting the data for spaCy, the base model here shouldn't be that necessary and it's mostly relevant if you have a custom model with custom tokenization or sentence segmentation.

I think the underlying problem here is that the sentence segmentation performed by the parser doesn't run during the export, so spaCy complains. This should also be resolved in the new Prodigy v1.11 and spaCy v3. (Note that the usage of the data-to-spacy command here is slightly different, and spaCy v3 isn't compatible with spaCy v2. So if you want to upgrade, I'd recommend using a fresh virtual environment.)

Understand. I ran without the --base-model pt_core_news_lg and it works, thanks!

Great to hear it, thanks for reporting back!