ner.manual gives ValueError: Mismatched tokenization.

Hi all,

Before I describe the exception I am getting, I want to give a little bit of context.
I managed to collect some automated annotation about NE (diseases in my case) on a bunch of texts and I want to use Prodigy to collect feedback on these annotations. For the moment I do not want to do active learning within Prodigy, I plan to do this a little bit later.

So, I am aware I can load my annotated text both using a custom recipe or by generating a JSONL file. To try it out, I first made a script to generate a JSONL file with the tasks.

An example will be the following:

{ 
    "text": "alecensa as monotherapy is indicated for the first-line treatment of adult patients with anaplastic lymphoma kinase (alk)-positive advanced non-small cell lung cancer (nsclc).alecensa as monotherapy is indicated for the treatment of adult patients with alk‑positive advanced nsclc previously treated with crizotinib.", 
    "meta": {
    "first_sentence": "",
    "source": "type",
    "indication_id": "4",
    "annotation_id": "483924"
  },
  "spans": [
    {
     "end": 173,
     "source": "type",
     "text": "nsclc",
     "rank": 0,
     "label": "INDICATION",
     "start": 168,
     "score": 0.5
    }
  ]
}

I tried to use ner.manual instead of ner.teach cause I did not want active learning at the moment.
So I run prodigy ner.manual condition_terms en_core_web_md custom.jsonl --label INDICATION

The web server starts but when I hit it, it fails with the following stacktrace:

16:49:48 - Exception when serving /get_questions
Traceback (most recent call last):
File “/home/ubuntu/prodigy/venv/lib/python3.5/site-packages/waitress/channel.py”, line 338, in service
task.service()
File “/home/ubuntu/prodigy/venv/lib/python3.5/site-packages/waitress/task.py”, line 169, in service
self.execute()
File “/home/ubuntu/prodigy/venv/lib/python3.5/site-packages/waitress/task.py”, line 399, in execute
app_iter = self.channel.server.application(env, start_response)
File “/home/ubuntu/prodigy/venv/lib/python3.5/site-packages/hug/api.py”, line 424, in api_auto_instantiate
return module.hug_wsgi(*args, **kwargs)
File “/home/ubuntu/prodigy/venv/lib/python3.5/site-packages/falcon/api.py”, line 244, in call
responder(req, resp, **params)
File “/home/ubuntu/prodigy/venv/lib/python3.5/site-packages/hug/interface.py”, line 734, in call
raise exception
File “/home/ubuntu/prodigy/venv/lib/python3.5/site-packages/hug/interface.py”, line 709, in call
self.render_content(self.call_function(input_parameters), request, response, **kwargs)
File “/home/ubuntu/prodigy/venv/lib/python3.5/site-packages/hug/interface.py”, line 649, in call_function
return self.interface(**parameters)
File “/home/ubuntu/prodigy/venv/lib/python3.5/site-packages/hug/interface.py”, line 100, in call
return __hug_internal_self._function(*args, **kwargs)
File “/home/ubuntu/prodigy/venv/lib/python3.5/site-packages/prodigy/app.py”, line 84, in get_questions
tasks = controller.get_questions()
File “cython_src/prodigy/core.pyx”, line 87, in prodigy.core.Controller.get_questions
File “cython_src/prodigy/core.pyx”, line 71, in iter_tasks
File “cython_src/prodigy/components/preprocess.pyx”, line 132, in add_tokens
ValueError: Mismatched tokenization. Can’t resolve span to token index 173. This can happen if your data contains pre-set spans. Make sure that the spans match spaCy’s tokenization or add a ‘tokens’ property to your task.

{‘text’: ‘nsclc’, ‘end’: 173, ‘label’: ‘INDICATION’, ‘start’: 168, ‘score’: 0.5, ‘rank’: 0, ‘token_start’: 28, ‘source’: ‘type’}

The start and end keys have the right values, so I am confused on why it is failing. If I use the ner.teach recipe it doesn’t complain when loading the tasks.

I am probably doing something wrong, so it would be great it you can shade some lights here.
I also though of creating my own recipe based on ner.teach by changing prefer_uncertain(predict(stream)) into stream. It would be great to have your opinion on this.

Many thanks for your great work.

The problem here is related to the token indices, not the character offsets. In the manual NER mode, the text is pre-tokenized to allow token-based highlighting and faster annotation, because your selection can snap to token boundaries.

If the text already has pre-defined spans, Prodigy will try to match them up with the tokenization and will add a token_start and token_end property to each span. You can check out spaCy’s tokenization by running the following:

import spacy

nlp = spacy.load('en_core_web_sm')  # or whichever model you're using
doc = nlp(u"alecensa as monotherapy is indicated for the first-line treatment of adult patients with anaplastic lymphoma kinase (alk)-positive advanced non-small cell lung cancer (nsclc).alecensa as monotherapy is indicated for the treatment of adult patients with alk‑positive advanced nsclc previously treated with crizotinib.")
print([token.text for token in doc])

I suspect that the problem might be this part: (nsclc).alecensa. If the punctuation isn’t split off from “nsclc”, Prodigy isn’t able to find a token that starts at character 168 and ends at character 173.

To solve this, you can either add your own "tokens" property to the task that tells Prodigy how the text should be tokenized (see the “Annotation task formats” in the docs for an example of this). You could also add another rule to spaCy’s tokenizer that forces stricter splitting on punctuation and then save out the model and use that instead (which will serialize your custom rules, too).

Finally, if the runtime model you’ll be training with the data later on won’t actually have to deal with punctuation like the example above, you could also just edit this text and add a space, so that your entity is split off correctly.

The predict function predicts all possible entities in the text, and the prefer_uncertain function sorts them by score, and focuses on the ones that the model is most uncertain about (the predictions with a score closest to 0.5). So if you remove that, you will see the examples from the stream as they come in.

Instead of doing that, you might just want to use the mark recipe with --view-id ner. This lets you stream in your pre-annotated text and asks you for binary feedback.

@ines thank you so much for the explanation. In the end, I decided to use the mark recipe. It was just what I was looking for.

I have a question, I still get this error in ner.manual even though I have added the tokens property to the input! Can you please help me? I would like to be able to have my predefined tokens be used instead of en_core_web_sm's tokenizer

@najmehs Are you able to share an example of a text plus span plus custom tokens it complains about? And can you double-check that your tokens all have IDs and start/end character offsets, that any existing spans you have pre-defined map to the correct token boundaries?

Here is an example, which ner.manual complains about:
{"meta": {"i": 0}, "text": "RBC-123", "tokens": [{"text": "RBC", "start": 0, "end": 3, "id": 0}, {"text": "123", "start": 4, "end": 7, "id": 1}], "spans": [{"start": 0, "end": 3, "label": "Lab"}]}

Thanks for sharing! I just tried it and it seems like if you add "token_start": 0 and "token_end": 0 to the "span", it works as expected! Prodigy should be able to figure this out by itself, though if the values are not set – I'll look in to this.

In the meantime, here's a working example:

from prodigy.components.preprocess import add_tokens
import spacy

nlp = spacy.load("en_core_web_sm")  # this won't be used, we just need to pass in an nlp object
stream = [{"meta": {"i": 0}, "text": "RBC-123", "tokens": [{"text": "RBC", "start": 0, "end": 3, "id": 0}, {"text": "123", "start": 4, "end": 7, "id": 1}], "spans": [{"start": 0, "end": 3, "label": "Lab", "token_start": 0, "token_end": 0}]}]  
new_stream = list(add_tokens(nlp, stream))

Thanks for your prompt response. But, if I add "token_start" and "token_end" it will not highlight "RBC", rather the whole "RBC-123" is highlighted!

@ines I appreciate your help, I have been stuck on this for sometime!!

Okay, sorry, I'll take another look! For now, just comment out the line that calls add_tokens in the ner.manual recipe. If your incoming data has all required attributes set, you won't need it anyways.