IndexError [E035] training recipe

I am trying to train on a gold dataset (an adjudicator did the review recipe).
This is my train command:

python -m prodigy train task_3_concepts_related_ILN_GOLD --ner ct_images_75_25_500_REVIEW --verbose

I am getting the following error:

IndexError: [E035] Error creating span with start 238 and end 15 for Doc of length 308.

I tried looking for the annotation that contained a start of 238 and end at 15, but I cant find any in the jsonl annotation file. Following along with IndexError in train recipe - Prodigy Support I also tried to find it by connecting to the database directly via python:

from prodigy.components.db import connect

db = connect()
examples = db.get_dataset("ct_images_75_25_500_REVIEW")

for i, example in enumerate(examples):
    if any((t['start'] == 238) and (t['end'] == 15)  for t in example['tokens']):
        print("found strange example at row {}".format(i))
        print(example)

With nothing returned

I then dropped the db and performed a db-in again to see if it gave anything different and its the same error.

Why could this be happening?

prodigy version: '1.11.7'

It's hard to say for sure what's going on, but let's try to figure it out :slight_smile:

The span error could be caused by an issue between two tokenisers. Do you remember how you labelled the data? Is it possible that two different tokenisers were used during labelling?

Could you share the full traceback?

Are there any docs that you can find with 308 tokens?

Hi koaning,

I found the document that is causing the issue. and its true, it seems to have:

"spans":[{"start":1253,"end":65,"token_start":238,"token_end":14,"label":"ATTENUATION"}]

why would there ever be a situation where a token start is larger than the end? This may sound dumb but is it possible that when highlighting the text (creating a span) when labeling if I were to highlight from right to left would do this?

This may sound dumb but is it possible that when highlighting the text (creating a span) when labeling if I were to highlight from right to left would do this?

This should not matter, but I am curious to see if I can reproduce this issue locally. Do you have the text that belongs to that document? Also, what tokeniser are you using? A custom one? What language are you working with? My gut feeling is that this is an issue related to the tokenizer, but it's hard to confirm without the actual example.

If you remove this one example, are there other examples that also show a similar issue?

Won't be able to give you the actual text (Confidential) but I can tell you that I did not specify a tokenizer, and I tried this with 2 different models, the default base model, and en_core_web_trf (that has its own tokenizer, I had not changed the pipeline once downloaded via spacy). How could I go about checking the tokenizer? My next step will be to remove the example.

Just to double-check, what version of spaCy are you running? Has the project always been using spaCy v3?

As the docs mention here, spaCy v3 introduced some features that deal with token alignment. So if there's an older version of spaCy that was used, that might explain what we're seeing.

Hi koaning,

I was able to fix the issue by eliminating the offending text. I did this by printing out the actual text in

conda_envs/opioid_phenotypes/lib/python3.9/site-packages/prodigy/recipes/data_utils.py

Thank you for your help.

1 Like