I just tried to get around this by saving the ginza model locally to disk, and using the below command to reference the local model.
prodigy train ner my_data ./models/ja_ginza --output ./models/new_ginza
I also updated prodigy to the current version:
$ prodigy stats
Version 1.10.4
Platform Linux-4.4.0-87-generic-x86_64-with-glibc2.17
Python Version 3.8.1
I now get this error instead:
KeyError: "[E002] Can't find factory for 'CompoundSplitter'. This usually happens when spaCy calls `nlp.create_pipe` with a component name that's not built in - for example, when constructing the pipeline from a model's meta.json. If you're using a custom component, you can write to `Language.factories['CompoundSplitter']` or remove it from the model meta and add it via `nlp.add_pipe` instead."
Not quite sure what i need to do here. Specifically, what are write to Language.factories['CompoundSplitter'] and ``add via nlp.add_pipe referring to?
Digging in a bit more, I find that the Can't find factory for 'CompoundSplitter' error only seems to occur when running this command on Linux.
On MacOS, training runs smoothly and a new model is created.
prodigy train ner test_ner ./models/ja_ginza/ --output ./models/new_ginza
Here are stats on each environment.
MacOS
Version 1.10.4
Location /Users/me/.pyenv/versions/3.6.1/lib/python3.6/site-packages/prodigy
Prodigy Home /Users/me/.prodigy
Platform Darwin-19.6.0-x86_64-i386-64bit
Python Version 3.6.1
Database Name SQLite
Database Id sqlite
Total Datasets 9
Total Sessions 60
Linux
Version 1.10.4
Location /home/me/.pyenv/versions/Prodigy2/lib/python3.8/site-packages/prodigy
Prodigy Home /home/me/.prodigy
Platform Linux-4.4.0-87-generic-x86_64-with-glibc2.17
Python Version 3.8.1
Database Name SQLite
Database Id sqlite
Total Datasets 5
Total Sessions 16
I would rather not change the python version on Linux, but what could be going on here?
Thank you, as always.
Glad you got it working! I think ultimately, this comes down to how that package was implemented – it's not an "official" spaCy model we distribute, which is why the spacy download command won't work. So I also don't know the details of how it's packaged etc.
The Linux/MacOS difference likely happens because you're running different Python environments here – maybe one of them has the Ginza package or additional dependencies installed that are needed to find the component, and maybe the other one doesn't?
This refers to the implementation details of the custom component CompoundSplitter, which I assume is provided by the Ginza library? If a third-party library exposes a custom component, it needs to make it available to spaCy, otherwise spaCy won't know how to set it up. One way to do this is via entry points, or by code that runs as part of the model package.