Hello, I have updated prodigy from 1.7.1 to 1.8.0 as well as spacy version to the latest 2.1.4. I have downloaded the lates versions of en_vectors_web_lg 2.1.0, but when I try to train model using ner.batch-train recipe I get the next error: “ValueError: [E030] Sentence boundaries unset. You can add the ‘sentencizer’ component to the pipeline with: nlp.add_pipe(nlp.create_pipe(‘sentencizer’)) Alternatively, add the dependency parser, or set sentence boundaries by setting doc[i].is_sent_start.”
What is interesting I was able to successfully use that recipe with older version of spacy/prodigy.
I will really appreciate any help/suggestion how to solve this error without rolling back in version.
Hi! That’s definitely strange – I just had a look and the ner.batch-train recipe should add the "sentencizer" component automatically if it’s not present in the model’s pipeline Could you post the full traceback of where the error is raised?
And what happens if you create your own version of the base model with the sentencizer pre-added? Like this:
Hi Ines, thank you for coming back to me so fast. So...
The full error stack looks the next way:
Loaded model en_vectors_web_lg
Using 20% of accept/reject examples (681) for evaluation
Traceback (most recent call last):
File "/usr/local/Cellar/python/3.7.1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/local/Cellar/python/3.7.1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/Users/d/Projects/Coach/.venv/lib/python3.7/site-packages/prodigy/main.py", line 380, in
controller = recipe(args, use_plac=True)
File "cython_src/prodigy/core.pyx", line 212, in prodigy.core.recipe.recipe_decorator.recipe_proxy
File "/Users/d/Projects/Coach/.venv/lib/python3.7/site-packages/plac_core.py", line 328, in call
cmd, result = parser.consume(arglist)
File "/Users/d/Projects/Coach/.venv/lib/python3.7/site-packages/plac_core.py", line 207, in consume
return cmd, self.func((args + varargs + extraopts), **kwargs)
File "/Users/d/Projects/Coach/.venv/lib/python3.7/site-packages/prodigy/recipes/ner.py", line 602, in batch_train
examples = list(split_sentences(model.orig_nlp, examples))
File "cython_src/prodigy/components/preprocess.pyx", line 45, in split_sentences
File "doc.pyx", line 595, in sents
ValueError: [E030] Sentence boundaries unset. You can add the 'sentencizer' component to the pipeline with: nlp.add_pipe(nlp.create_pipe('sentencizer')) Alternatively, add the dependency parser, or set sentence boundaries by setting doc[i].is_sent_start.
When I created my version of the base model with the sentencizer as you suggested I still see the same error:
Loaded model en_vectors_with_sentencizer
Using 20% of accept/reject examples (681) for evaluation
Traceback (most recent call last):
File "/usr/local/Cellar/python/3.7.1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/local/Cellar/python/3.7.1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/Users/d/Projects/Coach/.venv/lib/python3.7/site-packages/prodigy/main.py", line 380, in
controller = recipe(args, use_plac=True)
File "cython_src/prodigy/core.pyx", line 212, in prodigy.core.recipe.recipe_decorator.recipe_proxy
File "/Users/d/Projects/Coach/.venv/lib/python3.7/site-packages/plac_core.py", line 328, in call
cmd, result = parser.consume(arglist)
File "/Users/d/Projects/Coach/.venv/lib/python3.7/site-packages/plac_core.py", line 207, in consume
return cmd, self.func((args + varargs + extraopts), **kwargs)
File "/Users/d/Projects/Coach/.venv/lib/python3.7/site-packages/prodigy/recipes/ner.py", line 602, in batch_train
examples = list(split_sentences(model.orig_nlp, examples))
File "cython_src/prodigy/components/preprocess.pyx", line 45, in split_sentences
File "doc.pyx", line 595, in sents
ValueError: [E030] Sentence boundaries unset. You can add the 'sentencizer' component to the pipeline with: nlp.add_pipe(nlp.create_pipe('sentencizer')) Alternatively, add the dependency parser, or set sentence boundaries by setting doc[i].is_sent_start.
Thanks! Btw, assuming that your training examples are sentences (and not long paragraphs etc.), you could probably work around this issue by setting the --unsegmented flag, which should skip the sentence splitting.
Also, to get to the bottom of this, could you double-check one more thing for me? When you export your dataset, are there any single-token (e.g. one word) examples in there?
I just tested ner.batch-train with the large vectors model and the good news is, the sentence splitting does seem to work with the sentencizer. However, the is_sentenced check in spaCy (whether sentence boundaries have been applied) currently has one limitation: because the first token’s is_sent_start always defaults to True, it can’t tell whether boundaries have been applied if there’s only one token. We want to solve this in the future by rewriting the way sentence boundaries are stored in spaCy – but for now, this might explain why you’re seeing the error here.
Hi Ines, thank you so much for your quick reply and support around this issue.
So my training data consist mostly sentences around 20 - 200 words so is --unsegmented flag going to work ok for this size ?
To answer your question, I think yes I do have single-token samples. Here is a sample of my training data:
The --unsegmented flag only means that Prodigy won't apply the sentence segmenter to split your texts into sentences. If your examples are already pre-segmented, this is fine – but if your data contains lots of really long texts, you probably want to split them, because otherwise training may be slow and the long texts may throw off the model. So it should be fine in your case.
Ahh, I meant examples that consist of only one token. So basically, where "text" has only one word. Do you find any of those as well?
Do any of the entity spans you've annotated start or end on whitespace characters? In spaCy v2.1, it's now "illegal" for the named entity recognizer to predict entities that start or end with whitespace, or consist of only whitespace. For example, "\n", but also "hello\n". This should be a really helpful change, because those entities are pretty much always wrong, and making them "illegal" limits the options and moves the entity recognizer towards correct predictions. But it also means that if you data contains training examples like this, you probably want to remove or fix them.
and I do have spans which starts with "\n" based on sample above.
So should I clean up the annotation by removing "\n" if present in start/end position of the spans ?
Also should I remove the annotations where I have single word token ?
That'd be the easiest solution, yes. I think you should also be able to change it to "answer": "ignore" for those examples, instead of deleting them. You can use the db-out command to export the data as JSONL, edit the file and then re-import it to a fresh set using db-in. So you'll also always have a copy of the original dataset and don't lose any information.
@ines I’m running into a similar issue, but so far as I can see there are no errant newlines in my data. Are there other characters that are banned in a span?