Disappearing spans when using data-to-spacy


I have a dataset of contracts annotated for NER (not using Prodigy), which I first import into Prodigy (using db-in), then use data-to-spacy to export to a spacy dataset.
Training gave me poor results, and upon inspection I realized that some examples in the spacy dataset had no spans, even though they were there in my original data.
They seem to disappear during the transformation to spacy.

For reproducing it, here is the JSONL export (using db-out) of a single document that has this issue.

I generate the spacy dataset like so:

prodigy db-in sample_dataset sample_dataset.jsonl

mkdir spacy
prodigy data-to-spacy spacy/ --ner sample_dataset

Then I use the following Python code to check the generated train.spacy:

from spacy.training import Corpus

import spacy

filepath = "spacy/train.spacy"
nlp = spacy.blank("en")
corpus = Corpus(filepath)
train_data = corpus(nlp)
examples = list(train_data)
# Prints an empty tuple: ()
# But other examples I have tested do have some ents!

Any idea what could be going on?

Ubuntu 22.04.3 LTS
Prodigy 1.14.9 (same issue after upgrading to 1.14.12)
Spacy 3.7.2

I'm not 100% sure, but I think there may be a token mismatch here. I've taken your example and fed it to our ner.manual recipe like so:

python -m prodigy ner.manual xxx blank:en debug.jsonl --label PERSONAL_ATTRIBUTE,ORGANIZATION,NAME

This will take the existing annotations and render them in the UI. However, I get this error:

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/Users/vincent/Development/prodigy/prodigy/__main__.py", line 50, in <module>
  File "/Users/vincent/Development/prodigy/prodigy/__main__.py", line 44, in main
    controller = run_recipe(run_args)
  File "/Users/vincent/Development/prodigy/prodigy/cli.py", line 110, in run_recipe
    return Controller.from_components(command, components)
  File "/Users/vincent/Development/prodigy/prodigy/core.py", line 155, in from_components
    return cls(
  File "/Users/vincent/Development/prodigy/prodigy/core.py", line 307, in __init__
    if stream.is_empty:
  File "/Users/vincent/Development/prodigy/prodigy/components/stream.py", line 189, in is_empty
    return self.peek() is None
  File "/Users/vincent/Development/prodigy/prodigy/components/stream.py", line 204, in peek
    item = self._get_from_iterator()
  File "/Users/vincent/Development/prodigy/prodigy/components/stream.py", line 317, in _get_from_iterator
    data = next(self._iterator)
  File "/Users/vincent/Development/prodigy/prodigy/components/decorators.py", line 165, in inner
    yield from final_stream  # type: ignore
  File "/Users/vincent/Development/prodigy/prodigy/components/preprocess.py", line 203, in add_tokens
    _add_tokens(eg, doc, skip, overwrite, use_chars=use_chars)
  File "/Users/vincent/Development/prodigy/prodigy/components/preprocess.py", line 303, in _add_tokens
    eg["spans"] = sync_spans_to_tokens(eg["spans"], eg["tokens"], skip)
  File "/Users/vincent/Development/prodigy/prodigy/components/preprocess.py", line 282, in sync_spans_to_tokens
    raise ValueError(err.format(end_idx, repr(span)))
ValueError: Mismatched tokenization. Can't resolve span to token index 230. This can happen if your data contains pre-set spans. Make sure that the spans match spaCy's tokenization or add a 'tokens' property to your task.

{'start': 228, 'end': 230, 'text': 'Dr', 'label': 'NAME', 'token_start': 48}

However, when I db-in this dataset and train on it, I don't seem to see any complaints.

python -m prodigy db-in xxx-debug debug.jsonl
python -m prodigy train --ner xxx-debug      
ℹ Using CPU

========================= Generating Prodigy config =========================
ℹ Auto-generating config with spaCy
✔ Generated training config

=========================== Initializing pipeline ===========================
[2023-12-14 13:15:08,316] [INFO] Set up nlp object from config
Components: ner
Merging training and evaluation data for 1 components
  - [ner] Training: 1 | Evaluation: 0 (20% split)
Training: 1 | Evaluation: 0
Labels: ner (4)
[2023-12-14 13:15:08,325] [INFO] Pipeline: ['tok2vec', 'ner']
[2023-12-14 13:15:08,327] [INFO] Created vocabulary
[2023-12-14 13:15:08,327] [INFO] Finished initializing nlp object
[2023-12-14 13:15:08,508] [INFO] Initialized pipeline components: ['tok2vec', 'ner']
✔ Initialized pipeline

============================= Training pipeline =============================
Components: ner
Merging training and evaluation data for 1 components
  - [ner] Training: 1 | Evaluation: 0 (20% split)
Training: 1 | Evaluation: 0
Labels: ner (4)
ℹ Pipeline: ['tok2vec', 'ner']
ℹ Initial learn rate: 0.001
---  ------  ------------  --------  ------  ------  ------  ------
  0       0          0.00      0.00    0.00    0.00    0.00    0.00
200     200          0.00      0.00    0.00    0.00    0.00    0.00

So something is telling me that there's something going slightly awry on our end in terms of data validation. The train recipe uses the data-to-spacy functionality internally so I'll do a bit of a deep dive here. In the meantime though, you may want to confirm if you see the same error when you feed the data to ner.manual. You probably will see the same error, which implies something may have gone wrong translating the data from your other tool into the Prodigy format.

I'll report back when I've learned more. Please let me know if you've learned more but are still stuck. Even if Prodigy has a bug, I can still try and get you unstuck in the meantime! :slight_smile:

1 Like

Thank you for your investigation!

I can confirm that I receive the same error when using ner.manual.

The only data validation I have performed is to compare each span's text with the corresponding text in that range (e.g., text[span["start"]:span["end"]] == span["text"]), and they all match.

Now, the API I use only returns spans based on character indexes, and I am unsure of the tokenization method it employs. Could the issue be due to a mismatch between its tokenization and the tokenization used by SpaCy's blank:en model?
If I'm not mistaken, blank:en considers "Dr." as a single token, and the annotation I have is for "Dr".

As an aside for extra context all base English models inside of spaCy use the same tokeniser under the hood. So the tokens from nlp = spacy.blank("en") should be the same as those from spacy.load("en_core_web_sm"). These tokens are all determined by the same rule based system.

But yeah, it does sound like there's a mismatch. One avenue to explore is to retokenize everything by using this spaCy method which comes with a alignment_mode parameter that should allow you to wiggle around minor character issues. Beware that this is an automated method which may also cause spans to be highlighted that weren't originally the plan. But it could help your current issue, if only as a temporary measure.

import srsly 
import spacy 

nlp = spacy.blank("en")
ex = next(srsly.read_jsonl("debug.jsonl"))
doc = nlp(ex['text'])
doc.char_span(228, 230, label="NAME", alignment_mode="expand")
# Returns `Dr.` 

Have you tried something like that?

1 Like

That worked! The spacy dataset now has all the spans.

Here is the script I used, for future reference:

Fix misaligned spans in a Prodigy JSONL dataset.
import srsly
import spacy

filepath = "sample_dataset.jsonl"

nlp = spacy.blank("en")

new_examples = []
for example in srsly.read_jsonl(filepath):
    spans = example['spans']
    doc = nlp(example['text'])
    new_spans = []
    for span in spans:
        new_span = doc.char_span(span['start'], span['end'], label=span['label'], alignment_mode="expand")
        if span['text'] != new_span.text:
            print(f'"{span["text"]}" -> "{new_span}" ({new_span.start_char}:{new_span.end_char})')
            "start": new_span.start_char,
            "end": new_span.end_char,
            "label": new_span.label_,
            "text": new_span.text,
    example["spans"] = new_spans

srsly.write_jsonl(filepath, new_examples)