Using a spaCy-stanza model for tokenization with ner.manual

Is there a straightforward way to use a spaCy stanza model for tokenization with the ner.manual recipe?

I tried to save the model to disk and then use it with the ner.recipe from the command line, but this doesn't seem to work. Specifically, I tried:

$ python
>>> import stanza
>>> from spacy_stanza import StanzaLanguage
>>> snlp = stanza.Pipeline(lang="ru")
>>> nlp = StanzaLanguage(snlp)
>>> nlp.to_disk('/home/adamliter/stanza-spacy-ru-model')
>>> quit()
$ prodigy ner.manual test_dataset /home/adamliter/stanza-spacy-ru-model test_data.jsonl --label PER,LOC,ORG,MISC

This gives the following error:

OSError: [E050] Can't find model '/home/adamliter/stanza-spacy-ru-model'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.

Is this something I need to write a custom recipe for? Thanks in advance for any insight you can offer! :slight_smile:

Hi! The spacy-stanza wrapper currently doesn't serialize the stanza model data, so loading it back in still requires the loaded Stanza model. (I'm kinda confused by the error, though – this is normally what spaCy raises if the directory isn't valid or doesn't have a meta, so there might be something else going on here on top of it?)

Anyway, the good news is, once you have your spacy-stanza model loaded, you can use the nlp object like any other nlp object to tokenize text. So you should be able to adapt this example recipe here and just replace the nlp object with your spacy-stanza nlp object:

That's interesting. The directory on my computer does have a meta.json and a vocab directory, but, anyway, thanks for the pointer to the ner.manual recipe. This should be easy enough to modify. Much appreciated!

Yeah, I'm really not sure why you would be seeing that error :thinking: Here are the checks spaCy performs in order when you call spacy.load and only if all of them fail, error E050 is raised. Maybe as a sanity check, try:

from pathlib import Path