Using a spaCy-stanza model for tokenization with ner.manual

adamliter · June 29, 2020, 7:46pm

Is there a straightforward way to use a spaCy stanza model for tokenization with the ner.manual recipe?

I tried to save the model to disk and then use it with the ner.recipe from the command line, but this doesn't seem to work. Specifically, I tried:

$ python
>>> import stanza
>>> from spacy_stanza import StanzaLanguage
>>> snlp = stanza.Pipeline(lang="ru")
>>> nlp = StanzaLanguage(snlp)
>>> nlp.to_disk('/home/adamliter/stanza-spacy-ru-model')
>>> quit()
$ prodigy ner.manual test_dataset /home/adamliter/stanza-spacy-ru-model test_data.jsonl --label PER,LOC,ORG,MISC

This gives the following error:

OSError: [E050] Can't find model '/home/adamliter/stanza-spacy-ru-model'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.

Is this something I need to write a custom recipe for? Thanks in advance for any insight you can offer!

ines · June 30, 2020, 9:11am

Hi! The spacy-stanza wrapper currently doesn't serialize the stanza model data, so loading it back in still requires the loaded Stanza model. (I'm kinda confused by the error, though – this is normally what spaCy raises if the directory isn't valid or doesn't have a meta, so there might be something else going on here on top of it?)

Anyway, the good news is, once you have your spacy-stanza model loaded, you can use the nlp object like any other nlp object to tokenize text. So you should be able to adapt this example recipe here and just replace the nlp object with your spacy-stanza nlp object:

github.com

explosion/prodigy-recipes/blob/master/ner/ner_manual.py

import prodigy
from prodigy.components.loaders import JSONL
from prodigy.components.preprocess import add_tokens
from prodigy.util import split_string
import spacy
from typing import List, Optional


# Recipe decorator with argument annotations: (description, argument type,
# shortcut, type / converter function called on value before it's passed to
# the function). Descriptions are also shown when typing --help.
@prodigy.recipe(
    "ner.manual",
    dataset=("The dataset to use", "positional", None, str),
    spacy_model=("The base model", "positional", None, str),
    source=("The source data as a JSONL file", "positional", None, str),
    label=("One or more comma-separated labels", "option", "l", split_string),
    exclude=("Names of datasets to exclude", "option", "e", split_string),
)
def ner_manual(

This file has been truncated. show original

adamliter · June 30, 2020, 2:34pm

That's interesting. The directory on my computer does have a meta.json and a vocab directory, but, anyway, thanks for the pointer to the ner.manual recipe. This should be easy enough to modify. Much appreciated!

ines · June 30, 2020, 2:50pm

Yeah, I'm really not sure why you would be seeing that error Here are the checks spaCy performs in order when you call spacy.load and only if all of them fail, error E050 is raised. Maybe as a sanity check, try:

from pathlib import Path
print(Path("/home/adamliter/stanza-spacy-ru-model").exists())

Topic		Replies	Views
Language support; spacy-stanza; russian usage , spacy , off-topic	2	873	January 16, 2021
Blank spacy model without being trained usage , ner , spacy , solved	6	3340	July 29, 2021
Saving custom tokenizer spacy , solved	24	4723	November 2, 2021
Saving a trained NER model as a loadable module done , spacy	6	5055	September 29, 2017
How to use customized spaCy model in Prodigy? ner , spacy	6	491	July 3, 2023

Using a spaCy-stanza model for tokenization with ner.manual

Related topics