IndexError [E035] training recipe

klopez · May 20, 2022, 3:16pm

I am trying to train on a gold dataset (an adjudicator did the review recipe).
This is my train command:

python -m prodigy train task_3_concepts_related_ILN_GOLD --ner ct_images_75_25_500_REVIEW --verbose

I am getting the following error:

IndexError: [E035] Error creating span with start 238 and end 15 for Doc of length 308.

I tried looking for the annotation that contained a start of 238 and end at 15, but I cant find any in the jsonl annotation file. Following along with IndexError in train recipe - Prodigy Support I also tried to find it by connecting to the database directly via python:

from prodigy.components.db import connect

db = connect()
examples = db.get_dataset("ct_images_75_25_500_REVIEW")

for i, example in enumerate(examples):
    if any((t['start'] == 238) and (t['end'] == 15)  for t in example['tokens']):
        print("found strange example at row {}".format(i))
        print(example)

With nothing returned

I then dropped the db and performed a db-in again to see if it gave anything different and its the same error.

Why could this be happening?

prodigy version: '1.11.7'

koaning · May 23, 2022, 1:26pm

It's hard to say for sure what's going on, but let's try to figure it out

The span error could be caused by an issue between two tokenisers. Do you remember how you labelled the data? Is it possible that two different tokenisers were used during labelling?

Could you share the full traceback?

Are there any docs that you can find with 308 tokens?

klopez · May 27, 2022, 8:50pm

Hi koaning,

I found the document that is causing the issue. and its true, it seems to have:

"spans":[{"start":1253,"end":65,"token_start":238,"token_end":14,"label":"ATTENUATION"}]

why would there ever be a situation where a token start is larger than the end? This may sound dumb but is it possible that when highlighting the text (creating a span) when labeling if I were to highlight from right to left would do this?

koaning · May 29, 2022, 8:43pm

This may sound dumb but is it possible that when highlighting the text (creating a span) when labeling if I were to highlight from right to left would do this?

This should not matter, but I am curious to see if I can reproduce this issue locally. Do you have the text that belongs to that document? Also, what tokeniser are you using? A custom one? What language are you working with? My gut feeling is that this is an issue related to the tokenizer, but it's hard to confirm without the actual example.

If you remove this one example, are there other examples that also show a similar issue?

klopez · May 31, 2022, 3:25pm

Won't be able to give you the actual text (Confidential) but I can tell you that I did not specify a tokenizer, and I tried this with 2 different models, the default base model, and en_core_web_trf (that has its own tokenizer, I had not changed the pipeline once downloaded via spacy). How could I go about checking the tokenizer? My next step will be to remove the example.

koaning · June 1, 2022, 8:53am

Just to double-check, what version of spaCy are you running? Has the project always been using spaCy v3?

As the docs mention here, spaCy v3 introduced some features that deal with token alignment. So if there's an older version of spaCy that was used, that might explain what we're seeing.

klopez · June 2, 2022, 2:19pm

Hi koaning,

I was able to fix the issue by eliminating the offending text. I did this by printing out the actual text in

conda_envs/opioid_phenotypes/lib/python3.9/site-packages/prodigy/recipes/data_utils.py

Thank you for your help.

Topic		Replies	Views
IndexError in train recipe usage , ner , spacy , custom , training	6	659	May 9, 2022
TypeError when reviewing annotations spans.manual spancat	3	289	January 6, 2023
No Task Available ner , spacy , solved	14	1188	June 10, 2021
ValueError: Mismatched tokenization. in ner.make-gold ner , done	5	1451	March 11, 2018
Span out of index Error usage , spacy , off-topic	1	859	February 4, 2021

IndexError [E035] training recipe

Related topics