Recipe ner.batch-train results in ValueError: [E030]

Hello, I have updated prodigy from 1.7.1 to 1.8.0 as well as spacy version to the latest 2.1.4. I have downloaded the lates versions of en_vectors_web_lg 2.1.0, but when I try to train model using ner.batch-train recipe I get the next error: “ValueError: [E030] Sentence boundaries unset. You can add the ‘sentencizer’ component to the pipeline with: nlp.add_pipe(nlp.create_pipe(‘sentencizer’)) Alternatively, add the dependency parser, or set sentence boundaries by setting doc[i].is_sent_start.”

What is interesting I was able to successfully use that recipe with older version of spacy/prodigy.

I will really appreciate any help/suggestion how to solve this error without rolling back in version.

Thank you very much.

Hi! That’s definitely strange – I just had a look and the ner.batch-train recipe should add the "sentencizer" component automatically if it’s not present in the model’s pipeline :thinking: Could you post the full traceback of where the error is raised?

And what happens if you create your own version of the base model with the sentencizer pre-added? Like this:

import spacy

nlp = spacy.load("en_vectors_web_lg")
sentencizer = nlp.create_pipe("sentencizer")
nlp.add_pipe(sentencizer)
nlp.to_disk("/path/to/en_vectors_with_sentencizer")

Hi Ines, thank you for coming back to me so fast. So…
The full error stack looks the next way:

Loaded model en_vectors_web_lg
Using 20% of accept/reject examples (681) for evaluation
Traceback (most recent call last):
File “/usr/local/Cellar/python/3.7.1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/runpy.py”, line 193, in _run_module_as_main
main”, mod_spec)
File “/usr/local/Cellar/python/3.7.1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/runpy.py”, line 85, in _run_code
exec(code, run_globals)
File “/Users/d/Projects/Coach/.venv/lib/python3.7/site-packages/prodigy/main.py”, line 380, in
controller = recipe(args, use_plac=True)
File “cython_src/prodigy/core.pyx”, line 212, in prodigy.core.recipe.recipe_decorator.recipe_proxy
File “/Users/d/Projects/Coach/.venv/lib/python3.7/site-packages/plac_core.py”, line 328, in call
cmd, result = parser.consume(arglist)
File “/Users/d/Projects/Coach/.venv/lib/python3.7/site-packages/plac_core.py”, line 207, in consume
return cmd, self.func(
(args + varargs + extraopts), **kwargs)
File “/Users/d/Projects/Coach/.venv/lib/python3.7/site-packages/prodigy/recipes/ner.py”, line 602, in batch_train
examples = list(split_sentences(model.orig_nlp, examples))
File “cython_src/prodigy/components/preprocess.pyx”, line 45, in split_sentences
File “doc.pyx”, line 595, in sents
ValueError: [E030] Sentence boundaries unset. You can add the ‘sentencizer’ component to the pipeline with: nlp.add_pipe(nlp.create_pipe(‘sentencizer’)) Alternatively, add the dependency parser, or set sentence boundaries by setting doc[i].is_sent_start.

When I created my version of the base model with the sentencizer as you suggested I still see the same error:

Loaded model en_vectors_with_sentencizer
Using 20% of accept/reject examples (681) for evaluation
Traceback (most recent call last):
File “/usr/local/Cellar/python/3.7.1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/runpy.py”, line 193, in _run_module_as_main
main”, mod_spec)
File “/usr/local/Cellar/python/3.7.1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/runpy.py”, line 85, in _run_code
exec(code, run_globals)
File “/Users/d/Projects/Coach/.venv/lib/python3.7/site-packages/prodigy/main.py”, line 380, in
controller = recipe(args, use_plac=True)
File “cython_src/prodigy/core.pyx”, line 212, in prodigy.core.recipe.recipe_decorator.recipe_proxy
File “/Users/d/Projects/Coach/.venv/lib/python3.7/site-packages/plac_core.py”, line 328, in call
cmd, result = parser.consume(arglist)
File “/Users/d/Projects/Coach/.venv/lib/python3.7/site-packages/plac_core.py”, line 207, in consume
return cmd, self.func(
(args + varargs + extraopts), **kwargs)
File “/Users/d/Projects/Coach/.venv/lib/python3.7/site-packages/prodigy/recipes/ner.py”, line 602, in batch_train
examples = list(split_sentences(model.orig_nlp, examples))
File “cython_src/prodigy/components/preprocess.pyx”, line 45, in split_sentences
File “doc.pyx”, line 595, in sents
ValueError: [E030] Sentence boundaries unset. You can add the ‘sentencizer’ component to the pipeline with: nlp.add_pipe(nlp.create_pipe(‘sentencizer’)) Alternatively, add the dependency parser, or set sentence boundaries by setting doc[i].is_sent_start.

Thanks! :+1: Btw, assuming that your training examples are sentences (and not long paragraphs etc.), you could probably work around this issue by setting the --unsegmented flag, which should skip the sentence splitting.

Also, to get to the bottom of this, could you double-check one more thing for me? When you export your dataset, are there any single-token (e.g. one word) examples in there?

I just tested ner.batch-train with the large vectors model and the good news is, the sentence splitting does seem to work with the sentencizer. However, the is_sentenced check in spaCy (whether sentence boundaries have been applied) currently has one limitation: because the first token’s is_sent_start always defaults to True, it can’t tell whether boundaries have been applied if there’s only one token. We want to solve this in the future by rewriting the way sentence boundaries are stored in spaCy – but for now, this might explain why you’re seeing the error here.

Hi Ines, thank you so much for your quick reply and support around this issue.
So my training data consist mostly sentences around 20 - 200 words so is --unsegmented flag going to work ok for this size ?

To answer your question, I think yes I do have single-token samples. Here is a sample of my training data:

  "text": "Hola everyone! It's big lunch wed. Serving from 11am to 4pm. 3061 Riverside dr. 90027 (under the bridge) Ricky.",
  "_input_hash": -91823319,
  "_task_hash": 1505548957,
  "tokens": [
    {
      "text": "Hola",
      "start": 0,
      "end": 4,
      "id": 0
    },
    {
      "text": "everyone",
      "start": 5,
      "end": 13,
      "id": 1
    },
    {
      "text": "!",
      "start": 13,
      "end": 14,
      "id": 2
    },
    {
      "text": "It",
      "start": 15,
      "end": 17,
      "id": 3
    },
    {
      "text": "'s",
      "start": 17,
      "end": 19,
      "id": 4
    },
    {
      "text": "big",
      "start": 20,
      "end": 23,
      "id": 5
    },
    {
      "text": "lunch",
      "start": 24,
      "end": 29,
      "id": 6
    },
    {
      "text": "we",
      "start": 30,
      "end": 32,
      "id": 7
    },
    {
      "text": "d",
      "start": 32,
      "end": 33,
      "id": 8
    },
    {
      "text": ".",
      "start": 33,
      "end": 34,
      "id": 9
    },
    {
      "text": "Serving",
      "start": 35,
      "end": 42,
      "id": 10
    },
    {
      "text": "from",
      "start": 43,
      "end": 47,
      "id": 11
    },
    {
      "text": "11",
      "start": 48,
      "end": 50,
      "id": 12
    },
    {
      "text": "am",
      "start": 50,
      "end": 52,
      "id": 13
    },
    {
      "text": "to",
      "start": 53,
      "end": 55,
      "id": 14
    },
    {
      "text": "4",
      "start": 56,
      "end": 57,
      "id": 15
    },
    {
      "text": "pm",
      "start": 57,
      "end": 59,
      "id": 16
    },
    {
      "text": ".",
      "start": 59,
      "end": 60,
      "id": 17
    },
    {
      "text": "3061",
      "start": 61,
      "end": 65,
      "id": 18
    },
    {
      "text": "Riverside",
      "start": 66,
      "end": 75,
      "id": 19
    },
    {
      "text": "dr",
      "start": 76,
      "end": 78,
      "id": 20
    },
    {
      "text": ".",
      "start": 78,
      "end": 79,
      "id": 21
    },
    {
      "text": "90027",
      "start": 80,
      "end": 85,
      "id": 22
    },
    {
      "text": "(",
      "start": 86,
      "end": 87,
      "id": 23
    },
    {
      "text": "under",
      "start": 87,
      "end": 92,
      "id": 24
    },
    {
      "text": "the",
      "start": 93,
      "end": 96,
      "id": 25
    },
    {
      "text": "bridge",
      "start": 97,
      "end": 103,
      "id": 26
    },
    {
      "text": ")",
      "start": 103,
      "end": 104,
      "id": 27
    },
    {
      "text": "Ricky",
      "start": 105,
      "end": 110,
      "id": 28
    },
    {
      "text": ".",
      "start": 110,
      "end": 111,
      "id": 29
    }
  ],
  "spans": [
    {
      "start": 48,
      "end": 52,
      "token_start": 12,
      "token_end": 13,
      "label": "start_time"
    },
    {
      "start": 56,
      "end": 59,
      "token_start": 15,
      "token_end": 16,
      "label": "end_time"
    },
    {
      "start": 61,
      "end": 79,
      "token_start": 18,
      "token_end": 21,
      "label": "address"
    },
    {
      "start": 80,
      "end": 85,
      "token_start": 22,
      "token_end": 22,
      "label": "zip"
    }
  ],
  "answer": "accept"
}

so I tried to train with --unsegmented flag and I got the next error:

Loaded model en_vectors_web_lg
Using 20% of accept/reject examples (681) for evaluation
Using 100% of remaining examples (2726) for training
Dropout: 0.2  Batch size: 16  Iterations: 10  


BEFORE      0.000              
Correct     0    
Incorrect   1495
Entities    0                  
Unknown     0                  

#            LOSS         RIGHT        WRONG        ENTS         SKIP         ACCURACY  
 13%|████████████████████████████                                                                                                                                                                                             | 352/2726 [00:02<00:18, 130.20it/s]['O', 'O', 'O', 'O', 'U-city', 'O', 'B-address', 'I-address', 'I-address', 'L-address', 'O', 'U-time_range', 'O', 'O', 'O', 'O', 'O', 'U-location', 'O', 'O', 'O', 'B-address', 'I-address', 'I-address', 'L-address', 'O', 'U-time_range', 'O']
['O', 'O', 'O', 'U-date', 'U-truck', 'O', 'O', 'O', 'U-city', 'B-address', 'L-address', 'O', 'B-city', 'L-city', 'U-zip', 'O', 'U-time_range', 'O']
['O', 'O', 'O', 'O', 'O', 'U-date', 'B-time_range', 'I-time_range', 'I-time_range', 'L-time_range', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-address', 'I-address', 'L-address', 'O', 'O', 'B-location', 'I-location', 'L-location', 'O']
['O', 'O', 'U-date', 'O', 'O', 'O', 'O', 'O', 'O', 'B-address', 'I-address', 'I-address', 'L-address', 'O', 'B-city', 'L-city', 'U-state', 'U-zip', 'O', 'O', 'B-time_range', 'I-time_range', 'L-time_range', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
['U-address', 'U-zip', 'U-zip', 'U-zip', 'U-zip', 'U-zip', 'U-zip', 'U-zip', 'U-zip', 'U-zip', 'U-zip', 'U-zip', 'U-zip', 'U-zip', 'U-zip', 'U-zip', 'U-zip', 'U-zip', 'U-zip', 'U-zip', 'U-zip', 'U-zip', 'U-zip', 'U-zip', 'U-zip', 'U-zip', 'U-zip', 'U-zip', 'U-zip', 'U-zip', 'U-zip', 'U-zip', 'U-zip', 'U-zip', 'U-zip', 'U-zip', 'U-zip', 'U-intersection', 'U-truck']
['U-!time_range', 'U-!time_range', 'U-!start_time', 'U-!location', 'U-!start_time', 'U-!end_time', 'U-!end_time', 'U-!address', 'U-!zip', 'U-!address', 'U-!city', 'U-!location', 'U-!start_time', 'U-!address', 'U-!city', 'U-!end_time', 'U-!end_time', 'U-!location', 'U-!end_time', 'U-!end_time', 'U-!end_time', 'U-!end_time', 'U-!end_time', 'O']
['U-address', 'U-zip', 'U-zip', 'U-zip', 'U-intersection', 'U-truck']
['O', 'O', 'O', 'O', 'O', 'B-address', 'I-address', 'I-address', 'L-address', 'U-date', 'O', 'O', 'U-city', 'O', 'O']
['U-address', 'U-zip', 'U-zip', 'U-zip', 'U-zip', 'U-zip', 'U-zip', 'U-zip', 'U-zip', 'U-zip', 'U-zip', 'U-zip', 'U-zip', 'U-intersection', 'U-truck']
['U-address', 'U-zip', 'U-zip', 'U-zip', 'U-zip', 'U-zip', 'U-intersection', 'U-truck']
['U-address', 'U-zip', 'U-intersection', 'U-truck']
['U-address', 'U-zip', 'U-intersection', 'U-truck']
['O', 'B-time_range', 'I-time_range', 'L-time_range', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'U-truck', 'O', 'O', 'U-city', 'O', 'O', 'B-time_range', 'I-time_range', 'L-time_range', 'O', 'B-address', 'I-address', 'L-address', 'O', 'O']
['O', 'O', 'O', 'U-date', 'O', 'O', 'O', 'O', 'U-location', 'O', 'B-intersection', 'I-intersection', 'I-intersection', 'L-intersection', 'U-zip', 'O', 'B-time_range', 'I-time_range', 'L-time_range', 'O', 'O', 'O']
['O', 'O', 'O', 'B-date', 'L-date', 'O', 'O', 'O', 'U-truck', 'U-truck', 'U-truck', 'U-truck', 'U-truck', 'O', 'O']
['O', 'U-date', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-time_range', 'I-time_range', 'I-time_range', 'L-time_range', 'O', 'B-address', 'I-address', 'L-address', 'O', 'O', 'B-location', 'I-location', 'L-location', 'O', 'O', 'O']
Traceback (most recent call last):                                                                                                                                                                                                                                
  File "/usr/local/Cellar/python/3.7.1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/local/Cellar/python/3.7.1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/Users/d/Projects/Coach/.venv/lib/python3.7/site-packages/prodigy/__main__.py", line 380, in <module>
    controller = recipe(*args, use_plac=True)
  File "cython_src/prodigy/core.pyx", line 212, in prodigy.core.recipe.recipe_decorator.recipe_proxy
  File "/Users/d/Projects/Coach/.venv/lib/python3.7/site-packages/plac_core.py", line 328, in call
    cmd, result = parser.consume(arglist)
  File "/Users/d/Projects/Coach/.venv/lib/python3.7/site-packages/plac_core.py", line 207, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "/Users/d/Projects/Coach/.venv/lib/python3.7/site-packages/prodigy/recipes/ner.py", line 621, in batch_train
    examples, batch_size=batch_size, drop=dropout, beam_width=beam_width
  File "cython_src/prodigy/models/ner.pyx", line 362, in prodigy.models.ner.EntityRecognizer.batch_train
  File "cython_src/prodigy/models/ner.pyx", line 453, in prodigy.models.ner.EntityRecognizer._update
  File "cython_src/prodigy/models/ner.pyx", line 446, in prodigy.models.ner.EntityRecognizer._update
  File "cython_src/prodigy/models/ner.pyx", line 447, in prodigy.models.ner.EntityRecognizer._update
  File "/Users/dlukianenko/Projects/Foodtrucks/Coach/.venv/lib/python3.7/site-packages/spacy/language.py", line 457, in update
    proc.update(docs, golds, sgd=get_grads, losses=losses, **kwargs)
  File "nn_parser.pyx", line 413, in spacy.syntax.nn_parser.Parser.update
  File "nn_parser.pyx", line 519, in spacy.syntax.nn_parser.Parser._init_gold_batch
  File "transition_system.pyx", line 86, in spacy.syntax.transition_system.TransitionSystem.get_oracle_sequence
  File "transition_system.pyx", line 148, in spacy.syntax.transition_system.TransitionSystem.set_costs
ValueError: [E024] Could not find an optimal move to supervise the parser. Usually, this means the GoldParse was not correct. For example, are all labels added to the model?

The --unsegmented flag only means that Prodigy won’t apply the sentence segmenter to split your texts into sentences. If your examples are already pre-segmented, this is fine – but if your data contains lots of really long texts, you probably want to split them, because otherwise training may be slow and the long texts may throw off the model. So it should be fine in your case.

Ahh, I meant examples that consist of only one token. So basically, where "text" has only one word. Do you find any of those as well?

Do any of the entity spans you’ve annotated start or end on whitespace characters? In spaCy v2.1, it’s now “illegal” for the named entity recognizer to predict entities that start or end with whitespace, or consist of only whitespace. For example, "\n", but also "hello\n". This should be a really helpful change, because those entities are pretty much always wrong, and making them “illegal” limits the options and moves the entity recognizer towards correct predictions. But it also means that if you data contains training examples like this, you probably want to remove or fix them.

So yeah I do have annotation where "text" only one word or symbol like emoji e.g.

{
  "text": "\n.",
  "_input_hash": -1423933053,
  "_task_hash": -453702377,
  "spans": [
    {
      "token_start": 0,
      "token_end": 0,
      "start": 0,
      "end": 1,
      "text": "\n",
      "label": "location",
      "source": "locations",
      "input_hash": -1423933053
    },
    {
      "token_start": 1,
      "token_end": 1,
      "start": 1,
      "end": 2,
      "text": ".",
      "label": "time_range",
      "source": "locations",
      "input_hash": -1423933053
    }
  ],
  "tokens": [
    {
      "text": "\n",
      "start": 0,
      "end": 1,
      "id": 0
    },
    {
      "text": ".",
      "start": 1,
      "end": 2,
      "id": 1
    }
  ],
  "answer": "reject"
}

and I do have spans which starts with "\n" based on sample above.

So should I clean up the annotation by removing "\n" if present in start/end position of the spans ?
Also should I remove the annotations where I have single word token ?

That’d be the easiest solution, yes. I think you should also be able to change it to "answer": "ignore" for those examples, instead of deleting them. You can use the db-out command to export the data as JSONL, edit the file and then re-import it to a fresh set using db-in. So you’ll also always have a copy of the original dataset and don’t lose any information.

@ines I’m running into a similar issue, but so far as I can see there are no errant newlines in my data. Are there other characters that are banned in a span?

@oneextrafact Have you checked for other types of whitespace, like regular spaces?

Yes, that was it. I found the advice you gave here very helpful to removing them, and after that everything was working fine. Thanks!!

1 Like