ValueError: Mismatched tokenization. in ner.make-gold

Hi,

I wanted to try the ner.make-gold recipe on a dataset I’ve been building for some time (i.e. through several version of prodigy including beta…). I’m getting a ValueError: Mismatched tokenization. telling me that prodigy can’t find the span at the provided start / end indices, with a very nice and clear message by the way :slight_smile:

The thing is that I don’t through away this dataset :). I guess this problem won’t append a lot so maybe you could log a warning telling that this example will be discarded and continue with the next example ?

And concerning the error itself, It’s quite strange because when processing the sentence with spacy I’m getting the same start / end indices as those recorded in my dataset. Here is the example:

{
    "text": " The upstart streaming service, which is primarily geared for sports fans, has an uphill climb against deep-pocketed competitors marketing cable alternatives to cord-cutters: YouTube TV, Hulu Live and Sony's PlayStation Vue.",
    "spans": [
        {
            "answer": "accept",
            "end": 185,
            "input_hash": -2121127423,
            "label": "PRODUCT",
            "rank": 1,
            "score": 0.3231683859,
            "source": "core_web_sm",
            "start": 175,
            "text": "YouTube TV"
        },
     

Did you had some preprocessing on sentences, like stripping whitespace ?

Thomas

Thanks! And this is strange... Prodigy uses the model's tokenizer, so if you're also using the en_core_web_sm model, the tokenization should definitely match. Could you post more of the error message and the token index Prodigy is complaining about?

Based on the tokenized text, the character offsets are calculated for each token. You can also test this by running the following example, which will produce (text, start_token, end_token) tuples:

[(token.text, token.idx, token.idx + len(token.text)) for token in doc]

Using your example text, this correctly produces the following values for "YouTube TV". So "start": 175, "end": 185 should also correctly map to those two tokens :thinking:

('YouTube', 175, 182), ('TV', 183, 185)

I also just tested it manually with this bare-bones example, and it worked for me:

from prodigy.components.preprocess import add_tokens
import en_core_web_sm

nlp = en_core_web_sm.load()
text = " The upstart streaming service, which is primarily geared for sports fans, has an uphill climb against deep-pocketed competitors marketing cable alternatives to cord-cutters: YouTube TV, Hulu Live and Sony's PlayStation Vue."
stream = [{'text': text, 'spans': [{'start': 175, 'end': 185}]}]
new_stream = add_tokens(nlp, stream)
print(list(new_stream))

As a workaround, you could also try setting the "token_start" and "token_end" values manually, for example:

{"start": 176, "end": 185, "token_start": 31, "token_end": 32, ...}

That's a good idea – we could add a skip keyword argument to the add_tokens pre-processor that lets you skip the example instead of raising the error.

Here is the whole error message:

File "cython_src/prodigy/components/preprocess.pyx", line 90, in add_tokens
ValueError: Mismatched tokenization. Can't resolve span to token index 175. This can happen if your data contains pre-set spans. Make sure that the spans match spaCy's tokenization or add a 'tokens' property to your task. 

{'start': 175, 'end': 185, 'text': 'YouTube TV', 'rank': 1, 'label': 'PRODUCT', 'score': 0.3231683859, 'source': 'core_web_sm', 'input_hash': -2121127423, 'answer': 'accept'}

And yes your example works fine :confused:

Thanks for the update – I think I found the problem. It looks like there’s be a bug in the split_sentences pre-processor that doesn’t overwrite the pre-set spans correctly if one there are no spans within one sentence. This will be fixed in the next release – sorry about that!

Since spaCy splits your example text into two sentences (on the “:”), it fails to map the stray spans to the respective tokens in the first sentence and complains. If your dataset doesn’t contain long texts, a simple workaround for now could be to comment out the following line in ner.make-gold:

stream = split_sentences(nlp, stream)

Yes you’re right the problem comes from split_sentences.

I’m almost certain that this example comes from a text that has been splitted using the same split_sentences… but maybe it was with another model

This issue should be fixed in v1.4.0, which we just released! You can now also disable sentence segmentation in all relevant recipes by setting --unsegmented on the command line.

You can also define a split_sents_threshold in your prodigy.json, which is the minimum character length a text needs to have to be segmented into sentences (if possible). This lets you implement your own segmentation logic, while at the same time having a fallback in place in case a very long example splits through. (Otherwise, this would have a significant impact on the beam search logic used in the active learning-powered recipes.) For example, "split_sents_threshold": 5000 would tell Prodigy to only try and split texts with more than 5000 characters.