IndexError in train recipe

I am trying to use the train recipe after importing some preannotated jsonl data in this format:

{"text": "<i> ( d\u00e1le jen objednatel ) </i>", "spans": [{"token_end": 4, "start": 15, "label": "LABEL", "end": 26, "token_start": 4}]}

I load it as a dataset:

prodigy db-in cs_data ./annotated.jsonl 

Then when I use the train recipe, I run into an Index Error:

    span = doc.char_span(start, end, label)
  File "spacy\tokens\doc.pyx", line 573, in spacy.tokens.doc.Doc.char_span
  File "spacy\tokens\span.pyx", line 98, in spacy.tokens.span.Span.__cinit__
IndexError: [E035] Error creating span with start 17 and end 1 for Doc of length 18.

My train command is:

 prodigy train ./prodigy_models/ --ner cs_data --lang cs --verbose

Am I missing some fields in the annotated data file? Or is there something wrong with how I set up Spacy? I did check the data file and can't find any cases where end is smaller than start. Thanks!

I just ran spancat.manual on this single jsonl example.

{"text": "<i> ( d\u00e1le jen objednatel ) </i>"}

After labelling the name, I saved the label and quit prodigy. Next, I ran prodigy db-out to have a look at the data format that Prodigy provides.

{
	"text":"<i> ( d\u00e1le jen objednatel ) </i>",
	"_input_hash":-1815072948,
	"_task_hash":1525844819,
	"tokens":[
		{"text":"<","start":0,"end":1,"id":0,"ws":false},
		{"text":"i","start":1,"end":2,"id":1,"ws":false},
		{"text":">","start":2,"end":3,"id":2,"ws":true},
		{"text":"(","start":4,"end":5,"id":3,"ws":true},
		{"text":"d\u00e1le","start":6,"end":10,"id":4,"ws":true},
		{"text":"jen","start":11,"end":14,"id":5,"ws":true},
		{"text":"objednatel","start":15,"end":25,"id":6,"ws":true},
		{"text":")","start":26,"end":27,"id":7,"ws":true},
		{"text":"<","start":28,"end":29,"id":8,"ws":false},
		{"text":"/i","start":29,"end":31,"id":9,"ws":false},
		{"text":">","start":31,"end":32,"id":10,"ws":false}
	],
	"_view_id":"spans_manual",
	"spans":[
		{"start":6,"end":25,"token_start":4,"token_end":6,"label":"NAME"}
	],
	"answer":"accept",
	"_timestamp":1650876701
}

Just to check, did I label the same thing as you? Could you share the tokens in your jsonl file as well? In general, I recommend parsing out the HTML before labelling it because the spaCy tokeniser might have a hard time dealing with HTML.

I am able to refer to the dataset with this example that seems to run just fine.

> prodigy train --spancat spandemo
ℹ Using CPU

========================= Generating Prodigy config =========================
ℹ Auto-generating config with spaCy
Using 'spacy.ngram_range_suggester.v1' for 'spancat' with sizes 3 to 3 (inferred from data)
✔ Generated training config

=========================== Initializing pipeline ===========================
[2022-04-25 10:59:00,030] [INFO] Set up nlp object from config
Components: spancat
Merging training and evaluation data for 1 components
  - [spancat] Training: 1 | Evaluation: 0 (20% split)
Training: 1 | Evaluation: 0
Labels: spancat (1)
[2022-04-25 10:59:00,039] [INFO] Pipeline: ['spancat']
[2022-04-25 10:59:00,042] [INFO] Created vocabulary
[2022-04-25 10:59:00,042] [INFO] Finished initializing nlp object
[2022-04-25 10:59:00,064] [INFO] Initialized pipeline components: ['spancat']
✔ Initialized pipeline

============================= Training pipeline =============================
Components: spancat
Merging training and evaluation data for 1 components
  - [spancat] Training: 1 | Evaluation: 0 (20% split)
Training: 1 | Evaluation: 0
Labels: spancat (1)
ℹ Pipeline: ['spancat']
ℹ Initial learn rate: 0.001
E    #       LOSS SPANCAT  SPANS_SC_F  SPANS_SC_P  SPANS_SC_R  SCORE 
---  ------  ------------  ----------  ----------  ----------  ------
  0       0          1.80        0.00        0.00        0.00    0.00
200     200          1.17        0.00        0.00        0.00    0.00
400     400          0.00        0.00        0.00        0.00    0.00

Thank you for getting back to me. The label I am using is 'objednatel' in this example. I can't really tell which sample is causing an error. In our use case, the html tags provide context which is why I am leaving them in.

Am I supposed to provide tokens in the dataset? Perhaps the missing info is causing issues? I am only providing it in the format posted above. In this example, it should be: ['', '(, 'd\u00e1le', 'jen', 'objednatel ', ')', '']

{"text":"<i> ( d\u00e1le jen objednatel ) </i>","spans":[{"token_end":4,"start":15,"label":"DEFINING_KEYWORD","end":26,"token_start":4}],"_input_hash":-1815072948,"_task_hash":489858686,"answer":"accept"}

This is my output for db-out. Am I supposed to specify how is this tokenized? Sorry, I am new to this.

I also noticed you are using --spancat, is that necessary?

Just to follow up @koaning, I tried your train command too and get the same error

Ah! My bad! I mistook your NER task for a spancat task. Sorry about that.

I've gone through all the steps again. I just took your original example:

{"text": "<i> ( d\u00e1le jen objednatel ) </i>", "spans": [{"token_end": 4, "start": 15, "label": "LABEL", "end": 26, "token_start": 4}]}

And saved it into a file named ner-task.jsonl. Next, I added this one line into a dataset.

prodigy db-in cs_data ./ner-task.jsonl 

And I continued by running the train command.

> prodigy train ./prodigy_models/ --ner cs_data --lang cs --verbose
ℹ Using CPU

========================= Generating Prodigy config =========================
ℹ Auto-generating config with spaCy
✔ Generated training config

=========================== Initializing pipeline ===========================
[2022-05-02 09:49:14,984] [INFO] Set up nlp object from config
Components: ner
Merging training and evaluation data for 1 components
  - [ner] Training: 1 | Evaluation: 0 (20% split)
Training: 1 | Evaluation: 0
Labels: ner (1)
  - [ner] LABEL
[2022-05-02 09:49:14,994] [INFO] Pipeline: ['tok2vec', 'ner']
[2022-05-02 09:49:14,996] [INFO] Created vocabulary
[2022-05-02 09:49:14,997] [INFO] Finished initializing nlp object
Skipped 1 examples for 'ner'
Skipped 1 examples for 'ner'
Skipped 1 examples for 'ner'
[2022-05-02 09:49:15,018] [DEBUG] [W033] Training a new parser or NER using a model with no lexeme normalization table. This may degrade the performance of the model to some degree. If this is intentional or the language you're using doesn't have a normalization table, please ignore this warning. If this is surprising, make sure you have the spacy-lookups-data package installed and load the table in your config. The languages with lexeme normalization tables are currently: cs, da, de, el, en, id, lb, mk, pt, ru, sr, ta, th

Load the table in your config with:

[initialize.lookups]
@misc = "spacy.LookupsDataLoader.v1"
lang = ${nlp.lang}
tables = ["lexeme_norm"]

Skipped 1 examples for 'ner'
Skipped 1 examples for 'ner'
[2022-05-02 09:49:15,048] [INFO] Initialized pipeline components: ['tok2vec', 'ner']
✔ Initialized pipeline

============================= Training pipeline =============================
Components: ner
Merging training and evaluation data for 1 components
  - [ner] Training: 1 | Evaluation: 0 (20% split)
Training: 1 | Evaluation: 0
Labels: ner (1)
  - [ner] LABEL
ℹ Pipeline: ['tok2vec', 'ner']
ℹ Initial learn rate: 0.001
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------
Skipped 1 examples for 'ner'
  0       0          0.00      0.00    0.00    0.00    0.00    0.00
200     200          0.00      0.00    0.00    0.00    0.00    0.00
400     400          0.00      0.00    0.00    0.00    0.00    0.00
600     600          0.00      0.00    0.00    0.00    0.00    0.00
800     800          0.00      0.00    0.00    0.00    0.00    0.00
1000    1000          0.00      0.00    0.00    0.00    0.00    0.00
1200    1200          0.00      0.00    0.00    0.00    0.00    0.00
1400    1400          0.00      0.00    0.00    0.00    0.00    0.00
1600    1600          0.00      0.00    0.00    0.00    0.00    0.00
✔ Saved pipeline to output directory

This ran without errors on my machine. That means it can't hurt to ask; what version of Prodigy/Python/spaCy are you running? What operating system?

Thank you for trying it again. I double-checked the data and still don't see any cases where end < start. It seems to have something to do with the tokenizer but am not sure what should be the next logical debugging step.

I am running this on Windows 10 and Python 3.8.5. The relevant libraries' versions are:

spacy==3.2.4
spacy-legacy==3.0.9
spacy-loggers==1.0.2
prodigy==1.11.7

Your versions seem up to date, so it's probably not that.

Can you confirm that everything works for just the single example that I'm using? That single example I was using doesn't have a token that starts at 17, so I'm assuming the error appears at another example in your dataset.

It feels like there's a single example somewhere, I suppose the easiest way to find the culprit might be by running something like;

from prodigy.components.db import connect

db = connect()
examples = db.get_dataset("DATASET_NAME")

for i, example in enumerate(examples):
    if any((t['start'] == 17) & (t['end'] == 17) for t in example['tokens']):
        print("found strange example at row {i}")