relation recipe missing span annotation on custom tokens because of tokenization didnt match

We are trying to manual review our relation annotations over two spans.

Recipe I am using

prodigy rel.manual db_name en_core_web_lg /pathtojsonl.jsonl --label attributed_to --span-label ENT1,ENT2 --wrap

One of Example data:

{"text": "Roberts-Smith denies any wrongdoing.", "tokens": [{"text": "Roberts-Smith", "start": 0, "end": 13, "id": 0}, {"text": "denies", "start": 14, "end": 20, "id": 1}, {"text": "any wrongdoing", "start": 21, "end": 35, "id": 2}, {"text": ".", "start": 35, "end": 36, "id": 3}], "spans": [{"start": 21, "end": 35, "token_start": 2, "token_end": 3, "label": "ENT2"}, {"start": 0, "end": 13, "token_start": 0, "token_end": 1, "label": "ENT1"}], "relations": [{"head": 2, "child": 0, "label": "attributed_to", "head_span": {"start": 21, "end": 35, "token_start": 2, "token_end": 3, "label": "ENT2"}, "child_span": {"start": 0, "end": 13, "token_start": 0, "token_end": 1, "label": "ENT1"}}]}

And while running the recipe script i get this error warning message:

⚠ Skipped 2 span(s) that were already present in the input data because
the tokenization didn't match.
⚠ Skipped 2 span(s) that were already present in the input data because
the tokenization didn't match.
⚠ Skipped 2 span(s) that were already present in the input data because
the tokenization didn't match.

The relations are highlighted correctly just the span label appear on some and miss on others.

Problem is when i db-out i missing all my span labels for the warning ones.

How should i handle this warning to use custom tokens and spans

I tried reproducing the issue and didn't see the same warning on the example that you shared.

I saved this into examples.jsonl locally.

{"text": "Roberts-Smith denies any wrongdoing.", "tokens": [{"text": "Roberts-Smith", "start": 0, "end": 13, "id": 0}, {"text": "denies", "start": 14, "end": 20, "id": 1}, {"text": "any wrongdoing", "start": 21, "end": 35, "id": 2}, {"text": ".", "start": 35, "end": 36, "id": 3}], "spans": [{"start": 21, "end": 35, "token_start": 2, "token_end": 3, "label": "ENT2"}, {"start": 0, "end": 13, "token_start": 0, "token_end": 1, "label": "ENT1"}], "relations": [{"head": 2, "child": 0, "label": "attributed_to", "head_span": {"start": 21, "end": 35, "token_start": 2, "token_end": 3, "label": "ENT2"}, "child_span": {"start": 0, "end": 13, "token_start": 0, "token_end": 1, "label": "ENT1"}}]}

I annotated it.

python -m prodigy rel.manual db_name en_core_web_lg issue-5941/examples.jsonl --label attributed_to --span-label ENT1,ENT2 --wrap

And I'm able to run db-out without any issues popping up.

python -m prodigy db-out db_name

This is making me wonder if there might be an issue on your machine with regards to different Python environments and different spaCy versions. When spans don't match the tokens, that might be because two different tokenisers have been used, which could be caused by a virtualenv mishap. It's hard to know for sure if this is the issue, but it's worth checking.

Some things to check

Could you check that when you run the which commands shown below that they point to a virtualenv?

> which python
> which prodigy

Here's what my machine returns.

> which python
/home/vincent/Development/prodigy-demos/venv/bin/python
> which prodigy
/home/vincent/Development/prodigy-demos/venv/bin/prodigy

Also, could you run your commands with python -m prodigy instead of prodigy to see what happens? This forces the Prodigy from the venv to be used.

Could you also run the following python -m pip freeze commands.

> python -m pip freeze | grep prodigy       
> python -m pip freeze | grep spacy  

This is what my machine returns:

> python -m pip freeze | grep prodigy       
prodigy==1.11.7

> python -m pip freeze | grep spacy  
en-core-web-lg @ https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.4.0/en_core_web_lg-3.4.0-py3-none-any.whl
spacy==3.4.1

You'll notice in the final command there that the en_core_web_lg version matches the 3.4.x version of spaCy.

Finally what Python version are you running?