Disappearing spans when using data-to-spacy

didmar · December 14, 2023, 9:14am

Hello,

I have a dataset of contracts annotated for NER (not using Prodigy), which I first import into Prodigy (using db-in), then use data-to-spacy to export to a spacy dataset.
Training gave me poor results, and upon inspection I realized that some examples in the spacy dataset had no spans, even though they were there in my original data.
They seem to disappear during the transformation to spacy.

For reproducing it, here is the JSONL export (using db-out) of a single document that has this issue.

gist.github.com

https://gist.github.com/didmar/91b1e2c24c3e84fc1d04581f981acd65

sample_dataset.jsonl

{"text": "INTERNATIONAL CENTRE FOR SETTLEMENT OF INVESTMENT DISPUTES\nWORTH CAPITAL HOLDINGS 27 LLC\nV.\nREPUBLIC OF PERU\n(ICSID CASE NO. ARB/20/51)\nPROCEDURAL ORDER No. 1\nMembers of the Tribunal\nMs. Juliet Blanch, President of the Tribunal\nDr. Horacio Grigera Na\u00f3n, Arbitrator\nProf. Dr. Rolf Knieper, Arbitrator\nSecretary of the Tribunal\nMs. Sara Marzal\n30 September 2021\n\nWorth Capital Holdings 27 LLC v. Republic of Peru\n(ICSID Case No. ARB/20/51)\nProcedural Order No. 1\nCONTENTS\n1. Applicable Arbitration Rules ........................................................................................... 5\n2. Constitution of the Tribunal and Tribunal Members\u2019 Declarations ................................. 5\n3. Fees and Expenses of Members of the Tribunal .............................................................. 6\n4. Presence and Quorum ...................................................................................................... 6\n5. Rulings of the Tribunal .................................................................................................... 6\n6. Power to Fix Time Limits ................................................................................................ 7\n7. Secretary of the Tribunal ................................................................................................. 7\n8. Representation of the Parties ........................................................................................... 8\n9. Apportionment of Costs and Advance Payments to ICSID ............................................. 9\n10. Place of Proceeding.......................................................................................................... 9\n11. Procedural Languages, Translation and Interpretation .................................................. 10\n12. Routing of Communications .......................................................................................... 11\n13. Number of Copies and Method of Filing of Parties\u2019 Pleadings ..................................... 11\n14. Number and Sequence of Pleadings .............................................................................. 13\n15. Production of Documents .............................................................................................. 13\n16. Submission of Documents ............................................................................................. 15\n17. Witness Statements and Expert Reports ........................................................................ 17\n18. Examination of Witnesses and Experts.......................................................................... 17\n19. Pre-Hearing Organizational Meetings ........................................................................... 20\n20. Hearings ......................................................................................................................... 20\n21. Records of Hearings and Sessions ................................................................................. 21\n22. Post-Hearing Memorials and Statements of Costs......................................................... 22\n23. Transparency / Publication ............................................................................................ 22\n24. Submissions and Attendance to Hearing of the \u201cnon-disputing [TPA] Party\u201d .............. 23\n25. Amicus Curiae Submissions .......................................................................................... 23\n26. Data Protection............................................................................................................... 23\n27. Proposed Decision or Award on Liability ..................................................................... 23\nAnnex A Procedural Calendar ................................................................................................ 25\nAnnex B Electronic File Naming Guidelines ........................................................................ 31\nPage 2 of 33", "spans": [{"start": 0, "end": 58, "text": "INTERNATIONAL CENTRE FOR SETTLEMENT OF INVESTMENT DISPUTES", "label": "ORGANIZATION"}, {"start": 59, "end": 88, "text": "WORTH CAPITAL HOLDINGS 27 LLC", "label": "ORGANIZATION"}, {"start": 92, "end": 108, "text": "REPUBLIC OF PERU", "label": "ORGANIZATION"}, {"start": 110, "end": 115, "text": "ICSID", "label": "ORGANIZATION"}, {"start": 174, "end": 182, "text": "Tribunal", "label": "ORGANIZATION"}, {"start": 187, "end": 200, "text": "Juliet Blanch", "label": "NAME"}, {"start": 219, "end": 227, "text": "Tribunal", "label": "ORGANIZATION"}, {"start": 228, "end": 230, "text": "Dr", "label": "NAME"}, {"start": 232, "end": 252, "text": "Horacio Grigera Na\u00f3n", "label": "NAME"}, {"start": 275, "end": 287, "text": "Rolf Knieper", "label": "NAME"}, {"start": 317, "end": 325, "text": "Tribunal", "label": "ORGANIZATION"}, {"start": 330, "end": 341, "text": "Sara Marzal", "label": "NAME"}, {"start": 342, "end": 359, "text": "30 September 2021", "label": "DATE"}, {"start": 361, "end": 390, "text": "Worth Capital Holdings 27 LLC", "label": "ORGANIZATION"}, {"start": 394, "end": 410, "text": "Republic of Peru", "label": "ORGANIZATION"}, {"start": 412, "end": 417, "text": "ICSID", "label": "ORGANIZATION"}, {"start": 619, "end": 627, "text": "Tribunal", "label": "ORGANIZATION"}, {"start": 632, "end": 640, "text": "Tribunal", "label": "ORGANIZATION"}, {"start": 738, "end": 746, "text": "Tribunal", "label": "ORGANIZATION"}, {"start": 958, "end": 966, "text": "Tribunal", "label": "ORGANIZATION"}, {"start": 1217, "end": 1225, "text": "Tribunal", "label": "ORGANIZATION"}, {"start": 1503, "end": 1508, "text": "ICSID", "label": "ORGANIZATION"}, {"start": 3301, "end": 3328, "text": "\u201cnon-disputing [TPA] Party\u201d", "label": "PERSONAL_ATTRIBUTE"}], "meta": {"languages_detected": {"en": 0.8709656000137329}}}

I generate the spacy dataset like so:

prodigy db-in sample_dataset sample_dataset.jsonl

mkdir spacy
prodigy data-to-spacy spacy/ --ner sample_dataset

Then I use the following Python code to check the generated train.spacy:

from spacy.training import Corpus

import spacy

filepath = "spacy/train.spacy"
nlp = spacy.blank("en")
corpus = Corpus(filepath)
train_data = corpus(nlp)
examples = list(train_data)
print(examples[0].reference.ents)
# Prints an empty tuple: ()
# But other examples I have tested do have some ents!

Any idea what could be going on?

Ubuntu 22.04.3 LTS
Prodigy 1.14.9 (same issue after upgrading to 1.14.12)
Spacy 3.7.2

koaning · December 14, 2023, 12:18pm

I'm not 100% sure, but I think there may be a token mismatch here. I've taken your example and fed it to our ner.manual recipe like so:

python -m prodigy ner.manual xxx blank:en debug.jsonl --label PERSONAL_ATTRIBUTE,ORGANIZATION,NAME

This will take the existing annotations and render them in the UI. However, I get this error:

Using 3 label(s): PERSONAL_ATTRIBUTE, ORGANIZATION, NAME
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/Users/vincent/Development/prodigy/prodigy/__main__.py", line 50, in <module>
    main()
  File "/Users/vincent/Development/prodigy/prodigy/__main__.py", line 44, in main
    controller = run_recipe(run_args)
                 ^^^^^^^^^^^^^^^^^^^^
  File "/Users/vincent/Development/prodigy/prodigy/cli.py", line 110, in run_recipe
    return Controller.from_components(command, components)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/vincent/Development/prodigy/prodigy/core.py", line 155, in from_components
    return cls(
           ^^^^
  File "/Users/vincent/Development/prodigy/prodigy/core.py", line 307, in __init__
    if stream.is_empty:
       ^^^^^^^^^^^^^^^
  File "/Users/vincent/Development/prodigy/prodigy/components/stream.py", line 189, in is_empty
    return self.peek() is None
           ^^^^^^^^^^^
  File "/Users/vincent/Development/prodigy/prodigy/components/stream.py", line 204, in peek
    item = self._get_from_iterator()
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/vincent/Development/prodigy/prodigy/components/stream.py", line 317, in _get_from_iterator
    data = next(self._iterator)
           ^^^^^^^^^^^^^^^^^^^^
  File "/Users/vincent/Development/prodigy/prodigy/components/decorators.py", line 165, in inner
    yield from final_stream  # type: ignore
    ^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/vincent/Development/prodigy/prodigy/components/preprocess.py", line 203, in add_tokens
    _add_tokens(eg, doc, skip, overwrite, use_chars=use_chars)
  File "/Users/vincent/Development/prodigy/prodigy/components/preprocess.py", line 303, in _add_tokens
    eg["spans"] = sync_spans_to_tokens(eg["spans"], eg["tokens"], skip)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/vincent/Development/prodigy/prodigy/components/preprocess.py", line 282, in sync_spans_to_tokens
    raise ValueError(err.format(end_idx, repr(span)))
ValueError: Mismatched tokenization. Can't resolve span to token index 230. This can happen if your data contains pre-set spans. Make sure that the spans match spaCy's tokenization or add a 'tokens' property to your task.

{'start': 228, 'end': 230, 'text': 'Dr', 'label': 'NAME', 'token_start': 48}

However, when I db-in this dataset and train on it, I don't seem to see any complaints.

python -m prodigy db-in xxx-debug debug.jsonl
python -m prodigy train --ner xxx-debug      
ℹ Using CPU

========================= Generating Prodigy config =========================
ℹ Auto-generating config with spaCy
✔ Generated training config

=========================== Initializing pipeline ===========================
[2023-12-14 13:15:08,316] [INFO] Set up nlp object from config
Components: ner
Merging training and evaluation data for 1 components
  - [ner] Training: 1 | Evaluation: 0 (20% split)
Training: 1 | Evaluation: 0
Labels: ner (4)
[2023-12-14 13:15:08,325] [INFO] Pipeline: ['tok2vec', 'ner']
[2023-12-14 13:15:08,327] [INFO] Created vocabulary
[2023-12-14 13:15:08,327] [INFO] Finished initializing nlp object
[2023-12-14 13:15:08,508] [INFO] Initialized pipeline components: ['tok2vec', 'ner']
✔ Initialized pipeline

============================= Training pipeline =============================
Components: ner
Merging training and evaluation data for 1 components
  - [ner] Training: 1 | Evaluation: 0 (20% split)
Training: 1 | Evaluation: 0
Labels: ner (4)
ℹ Pipeline: ['tok2vec', 'ner']
ℹ Initial learn rate: 0.001
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------
  0       0          0.00      0.00    0.00    0.00    0.00    0.00
200     200          0.00      0.00    0.00    0.00    0.00    0.00

So something is telling me that there's something going slightly awry on our end in terms of data validation. The train recipe uses the data-to-spacy functionality internally so I'll do a bit of a deep dive here. In the meantime though, you may want to confirm if you see the same error when you feed the data to ner.manual. You probably will see the same error, which implies something may have gone wrong translating the data from your other tool into the Prodigy format.

I'll report back when I've learned more. Please let me know if you've learned more but are still stuck. Even if Prodigy has a bug, I can still try and get you unstuck in the meantime!

didmar · December 14, 2023, 1:54pm

Thank you for your investigation!

I can confirm that I receive the same error when using ner.manual.

The only data validation I have performed is to compare each span's text with the corresponding text in that range (e.g., text[span["start"]:span["end"]] == span["text"]), and they all match.

Now, the API I use only returns spans based on character indexes, and I am unsure of the tokenization method it employs. Could the issue be due to a mismatch between its tokenization and the tokenization used by SpaCy's blank:en model?
If I'm not mistaken, blank:en considers "Dr." as a single token, and the annotation I have is for "Dr".

koaning · December 14, 2023, 2:23pm

As an aside for extra context all base English models inside of spaCy use the same tokeniser under the hood. So the tokens from nlp = spacy.blank("en") should be the same as those from spacy.load("en_core_web_sm"). These tokens are all determined by the same rule based system.

But yeah, it does sound like there's a mismatch. One avenue to explore is to retokenize everything by using this spaCy method which comes with a alignment_mode parameter that should allow you to wiggle around minor character issues. Beware that this is an automated method which may also cause spans to be highlighted that weren't originally the plan. But it could help your current issue, if only as a temporary measure.

import srsly 
import spacy 

nlp = spacy.blank("en")
ex = next(srsly.read_jsonl("debug.jsonl"))
doc = nlp(ex['text'])
doc.char_span(228, 230, label="NAME", alignment_mode="expand")
# Returns `Dr.`

Have you tried something like that?

didmar · December 15, 2023, 8:45am

That worked! The spacy dataset now has all the spans.

Here is the script I used, for future reference:

"""
Fix misaligned spans in a Prodigy JSONL dataset.
"""
import srsly
import spacy

filepath = "sample_dataset.jsonl"

nlp = spacy.blank("en")

new_examples = []
for example in srsly.read_jsonl(filepath):
    spans = example['spans']
    doc = nlp(example['text'])
    new_spans = []
    for span in spans:
        new_span = doc.char_span(span['start'], span['end'], label=span['label'], alignment_mode="expand")
        if span['text'] != new_span.text:
            print(f'"{span["text"]}" -> "{new_span}" ({new_span.start_char}:{new_span.end_char})')
        new_spans.append({
            "start": new_span.start_char,
            "end": new_span.end_char,
            "label": new_span.label_,
            "text": new_span.text,
        })
    example["spans"] = new_spans
    new_examples.append(example)

srsly.write_jsonl(filepath, new_examples)

Topic		Replies	Views
No start and end of span using data-to-spacy after rel.manual ner , spacy , solved , relations	4	854	May 5, 2021
Prodigy annotations to SpaCy train spacy	13	5623	January 31, 2018
Matching tokenisation on pre-existing annotated data usage , ner , spacy , solved	2	558	March 27, 2020
revising annotation by prodigy--here only one label (DATE) usage , ner , solved	16	1934	May 20, 2019
How to convert prodigy dataset to .spacy object? usage , spacy , solved	6	1312	January 13, 2023

Disappearing spans when using data-to-spacy

Related topics