Unable to use train and run data-to-spacy recipes for spancat on prodigy 1.11.10

hi @bmosher01!

Thanks for your question and welcome to the Prodigy community :wave:

First off -- thank you so much for your detailed issue. This helps us so much and we greatly appreciate (and can respond much faster) when users provide good details of their issue.

Do you have the same problem if you remove the --base-model? Either when annotating (e.g., in training or converting the data with data-to-spacy)?

We've recently found some potential issues with the --base-model with prodigy train, but maybe it also affectsdata-to-spacy too.

Just curious, can you explain your thinking of using the en_core_sci_sm model (SciSpaCy)?

Typically base models are used when you want to use those vectors in a future pipeline, so I could see if using SciSpaCy in data-to-spacy if you wanted your pipeline to have SciSpaCy's vectors during training. (I guess in theory, you could also use the sole vector models like en_core_sci_lg instead).

I could also see SciSpaCy helping if you wanted to use a correct or teach model that you were trying to use one of its' components (say a custom ner) and correct/teach it in Prodigy. However, for spans, you likely may be just as well okay with a blank tokenizer.

prodigy spans.manual my_project en_core_sci_sm C:\Prodigy\Data\my_project.csv --loader csv --label RESPIRATORY,NEGATIVE

Also for annotating manual recipes, you essentially could use any English tokenizer (e.g., blank:en). But I don't think the annotations are the problem. It's training or running data-to-spacy.

I'll admit I haven't used SciSpaCy before so I'll need to look more into it.

One last thing - I see you're running spaCy 3.4.4. Do you know if SciSpaCy 0.5.0 works for spaCy 3.4.4? just know sometimes it's hard to keep up with newer versions of spaCy, for example:

Regardless, let us know if you can at least overcome this bottleneck.