Bug Report: "TypeError: Cannot read properties of undefined (reading 'push')" when using pre-annotated spans for span.manual

:bug: Bug description:

Using the span.manual recipe with already annotated spans in the json file, I get this error message.

TypeError: Cannot read properties of undefined (reading 'push')

:man_walking:t4:Reproduction steps:
How can we recreate the bug?

Source file (jsonl):

{"text":"Wij bieden je een leuke baan, een prima werkplek met kost en inwoning, een aantrekkelijk salaris en uitzicht op een dienstverband voor langere tijd.\nHet salaris wordt na proefdagen definitief vastgesteld op basis van leeftijd, opleiding, kennis en kunde.\njaarcontract op basis van 32 uur per week of fulltime voor bepaalde tijd.","spans":[{"start":7,"end":14,"token_start":30,"token_end":69,"label":"COMPANY CULTURE"},{"start":15,"end":18,"token_start":71,"token_end":96,"label":"SALARY"},{"start":28,"end":46,"token_start":149,"token_end":254,"label":"SALARY"},{"start":47,"end":61,"token_start":255,"token_end":328,"label":"TYPE OF EMPLOYMENT CONTRACT"}]}

(shortened for readability, same happens if there are several lines in the jsonl file)

Recipe: spans.manual
Spacy Model: nl_core_news_lg

prodigy spans.correct pull_factors nl_core_news_lg prodigy_data.jsonl --label SALARY

:desktop_computer: Environment variables:
Please provide prodigy stats or Python version/OS/Prodigy version:

Version          1.14.0                        
Prodigy Home     /home/nlp/.prodigy            
Platform         Linux-5.4.0-148-generic-x86_64-with-glibc2.29
Python Version   3.8.10                        
Spacy Version    3.5.4                         
Database Name    SQLite                        
Database Id      sqlite     

My prodigy.json file only sets the port:

{
    "port": 8999
}

Please let me know if more information is needed. Thanks already!

Thanks @annatmp!

Sorry you're having issues (and welcome to the Prodigy community :wave:).

Thanks as well for the detailed question/info - this helps us a lot :pray: .

Just curious - how did you get those span annotations? Did you use another tool? If so, what tokenizer did you use? Was it different than spaCy's NL tokenizer?

I noticed your annotations didn't include tokens. For span annotations, they typically include the span annotations and the tokens too. See the spans.manual interface example data.

There could be some mismatch where when you use nl_core_news_lg, it'll use the NL tokenizer but your spans may have used a different tokenizer.

One other thing -- especially if you used a different tool for your annotations -- is you may be running into this small difference with spans in Prodigy (see the docs):

Note that the "token_end" value of the spans is inclusive and not exclusive (like spaCy’s token indices for Span objects or list indices in Python). So a span with start 5 and end 6 will include the tokens 5 and 6 and the token span in spaCy would be doc[token_start : token_end + 1] . We’re hoping to make this consistent in the future, but it’d be a breaking change and require a new version of Prodigy’s data format.

To test the hypothesis that there's some discrepancy in your pre-annotated tokens with how Prodigy tokenizes/does spans, can you take your one example but without your pre-annotated spans (using spans.manual):

{"text":"Wij bieden je een leuke baan, een prima werkplek met kost en inwoning, een aantrekkelijk salaris en uitzicht op een dienstverband voor langere tijd.\nHet salaris wordt na proefdagen definitief vastgesteld op basis van leeftijd, opleiding, kennis en kunde.\njaarcontract op basis van 32 uur per week of fulltime voor bepaalde tijd."}

Label it identically in Prodigy using the same span labels you'd expect/want in your pre-annotated data example. Then run db-out to export that annotation and compare the two datasets. Are there any differences?

This will give you a comparison with your pre-annotated data versus how Prodigy would expect the same labels (from the db-out).

One other small thing: you mentioned you were using the spans.manual recipe but in your example you used the spans.correct

Since it seems like you have pre-annotated data, then yes, you'd use the spans.manual recipe.

The spans.correct assumes you have a spancat component which, by default it doesn't. This wouldn't (I don't think) cause the issue you're running into, but it's definitely something I want to make sure I understand (you could've just accidentally typed spans.correct).

Hope this helps!

Hi!

Thanks for coming back to me. I found a work around after posting it.

I did try to get the spans using the llm spans recipe, but in practice, the OpenAI response times where too long that I used older annotations I still had. Maybe that caused a mismatch somewhere.

In the end I got it running by building a custom span categorizer, that simly loaded the annotation from a file+ custom recipe for calling the span categorizer.

(I indeed intended to use the spans.manual recipe but copied the wrong call into the bug report. The error you see was caused by the spans.manual)