Bug with split_sentences/add_tokens in ner.batch-train

I have been working on importing annotations from Watson Knowledge Studio to Prodigy and have encountered a KeyError with batch-train if an annotation ends with a single letter followed by a period i.e. "example sentence A" in the sentence "An annotation that fails can be seen in example sentence A."

Replacing the period with a space or comma causes this error to go away, as does substituting the last letter with a number. Example code reproducing this bug can be found below:

import spacy
from prodigy.components.preprocess import split_sentences
nlp = spacy.load("en_core_web_sm") 

good1 = {"text": "An annotation that fails can be seen in example sentence 1.", "spans": [{"start": 40, "end": 58, "text": "example sentence 1", "label": "BUG_EXAMPLE"}]}
good2 = {"text": "An annotation that fails can be seen in example sentence A,", "spans": [{"start": 40, "end": 58, "text": "example sentence A", "label": "BUG_EXAMPLE"}]}
good3 = {"text": "An annotation that fails can be seen in example sentence AB.", "spans": [{"start": 40, "end": 59, "text": "example sentence AB", "label": "BUG_EXAMPLE"}]}
bad   = {"text": "An annotation that fails can be seen in example sentence A.", "spans": [{"start": 40, "end": 58, "text": "example sentence A", "label": "BUG_EXAMPLE"}]}

examples = [good1, good2, good3, bad]
for example in examples:
    list(split_sentences(nlp, [example]))
    print("works")

the output and error message:

works
works
works

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-96-b09644e1e993> in <module>()
      6 examples = [good1, good2, good3, bad]
      7 for example in examples:
----> 8     list(split_sentences(nlp, [example]))
      9     print("works")

cython_src/prodigy/components/preprocess.pyx in split_sentences()

cython_src/prodigy/components/preprocess.pyx in prodigy.components.preprocess._add_tokens()

KeyError: 58

Thanks, this is unfortunately a bug in that preprocessing function. The error’s occurring when you have spans which don’t align with the tokenization spaCy is predicting. You’re supposed to be able to tell it to skip the examples where the tokenizer doesn’t match the annotations, but we’re missing a continue there, so it’s hitting the key error all the same.

There are a few possible resolutions for this. One option would be to have a wrapper around spaCy’s tokenizer that over-ruled it when these conflicts occurred. Another option would be to filter out the spans that don’t align to token boundaries from the data.

You can detect the misaligned spans like this:


doc = nlp.make_doc(eg["text"])
starts = {token.idx: token.i for token in doc}
ends = {token.idx+len(token): token.i for token in doc}
for span in eg["spans"]:
    if span["start"] not in starts or span["end"] not in ends:
        print("Span/token mismatch", span["text"])

I’ve already pushed a fix to the bug, but in the meantime, I hope you can set up a workaround that keeps you moving. Let me know if you need more advice on that.

Thank you for the quick reply! I had some code in place that was finding bad spans since Watson allows spans to cover parts of tokens for some reason, but it failed for my illustrated scenario. I set a flag when copying spans from Watson’s format to a prodigy compatible one to skip misaligned spans using the code you provided and it looks to have solved the problem.

For anyone looking for a similar solution, I ended up doing the following:

with open('output.jsonl', 'w') as outfile:
    for eg in documents:
        doc = nlp.make_doc(eg["text"])
        starts = {token.idx: token.i for token in doc}
        ends = {token.idx+len(token): token.i for token in doc}
        eg["spans"] = [span for span in eg["spans"] if span["start"] in starts and span["end"] in ends]
        outfile.write(json.dumps(eg)+"\n")

Also, disabling unneeded pipelines in the nlp model saves a lot of time if you are processing a lot of test

nlp = spacy.load('en_core_web_sm',  disable=['tagger', 'ner'])