Prodigy on databricks gets: ValueError: The HF model requires `transformers` to be installed

I am trying to run and annotate with fetch command with prodigy llms on Databricks but there is a transformers error. I tried to use poetry because I thought I had issues with dependencies but prodigy doesn't see transformers even though they are installed.

command:

poetry run prodigy ner.llm.fetch  config.cfg input_texts.csv output_texts.jsonl

I get the follwing error after running:

2023-10-03 06:44:59.028731: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
/databricks/python/lib/python3.10/site-packages/scipy/__init__.py:155: UserWarning: A NumPy version >=1.18.5 and <1.25.0 is required for this version of SciPy (detected version 1.25.2
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-4f9c0c67-1213-4f0f-8220-29e6affd48ae/lib/python3.10/site-packages/prodigy/__main__.py", line 63, in <module>
    controller = recipe(*args, use_plac=True)
  File "cython_src/prodigy/core.pyx", line 883, in prodigy.core.recipe.recipe_decorator.recipe_proxy
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-4f9c0c67-1213-4f0f-8220-29e6affd48ae/lib/python3.10/site-packages/plac_core.py", line 367, in call
    cmd, result = parser.consume(arglist)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-4f9c0c67-1213-4f0f-8220-29e6affd48ae/lib/python3.10/site-packages/plac_core.py", line 232, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-4f9c0c67-1213-4f0f-8220-29e6affd48ae/lib/python3.10/site-packages/prodigy/recipes/llm/ner.py", line 134, in llm_fetch_ner
    nlp = assemble(config_path, overrides=config_overrides)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-4f9c0c67-1213-4f0f-8220-29e6affd48ae/lib/python3.10/site-packages/spacy_llm/util.py", line 49, in assemble
    return assemble_from_config(config)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-4f9c0c67-1213-4f0f-8220-29e6affd48ae/lib/python3.10/site-packages/spacy_llm/util.py", line 29, in assemble_from_config
    nlp = load_model_from_config(config, auto_fill=True)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-4f9c0c67-1213-4f0f-8220-29e6affd48ae/lib/python3.10/site-packages/spacy/util.py", line 587, in load_model_from_config
    nlp = lang_cls.from_config(
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-4f9c0c67-1213-4f0f-8220-29e6affd48ae/lib/python3.10/site-packages/spacy/language.py", line 1848, in from_config
    nlp.add_pipe(
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-4f9c0c67-1213-4f0f-8220-29e6affd48ae/lib/python3.10/site-packages/spacy/language.py", line 814, in add_pipe
    pipe_component = self.create_pipe(
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-4f9c0c67-1213-4f0f-8220-29e6affd48ae/lib/python3.10/site-packages/spacy/language.py", line 702, in create_pipe
    resolved = registry.resolve(cfg, validate=validate)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-4f9c0c67-1213-4f0f-8220-29e6affd48ae/lib/python3.10/site-packages/confection/__init__.py", line 756, in resolve
    resolved, _ = cls._make(
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-4f9c0c67-1213-4f0f-8220-29e6affd48ae/lib/python3.10/site-packages/confection/__init__.py", line 805, in _make
    filled, _, resolved = cls._fill(
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-4f9c0c67-1213-4f0f-8220-29e6affd48ae/lib/python3.10/site-packages/confection/__init__.py", line 860, in _fill
    filled[key], validation[v_key], final[key] = cls._fill(
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-4f9c0c67-1213-4f0f-8220-29e6affd48ae/lib/python3.10/site-packages/confection/__init__.py", line 877, in _fill
    getter_result = getter(*args, **kwargs)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-4f9c0c67-1213-4f0f-8220-29e6affd48ae/lib/python3.10/site-packages/spacy_llm/models/hf/falcon.py", line 79, in falcon_hf
    return Falcon(name=name, config_init=config_init, config_run=config_run)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-4f9c0c67-1213-4f0f-8220-29e6affd48ae/lib/python3.10/site-packages/spacy_llm/models/hf/falcon.py", line 23, in __init__
    super().__init__(name=name, config_init=config_init, config_run=config_run)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-4f9c0c67-1213-4f0f-8220-29e6affd48ae/lib/python3.10/site-packages/spacy_llm/models/hf/base.py", line 37, in __init__
    HuggingFace.check_installation()
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-4f9c0c67-1213-4f0f-8220-29e6affd48ae/lib/python3.10/site-packages/spacy_llm/models/hf/base.py", line 78, in check_installation
    raise ValueError(
ValueError: The HF model requires `transformers` to be installed, which it is not. See https://huggingface.co/docs/transformers/installation for installation instructions.

In my config.cfg file am using the falcon-40b-instruct model

[components.llm.model]
@llm_models = spacy.Falcon.v1
name = falcon-40b-instruct

Interesting, the value error appears because of this code, which seems sound.

So just to check, could you run this command and share the results?

poetry run python -m pip freeze

Curiosity: is there a reason why you're using poetry instead of pip?

I was using pip earlier and getting the same error, so I thought some versions were conflicting so opted to use poetry.

The command:

poetry run python -m pip freeze

Output:

absl-py==1.0.0
accelerate==0.19.0
aiofiles==23.2.1
aiohttp==3.8.4
aiosignal==1.3.1
anyio==4.0.0
appdirs==1.4.4
argon2-cffi==21.3.0
argon2-cffi-bindings==21.2.0
astor==0.8.1
asttokens==2.2.1
astunparse==1.6.3
async-timeout==4.0.2
attrs==21.4.0
audioread==3.0.0
azure-core==1.27.1
azure-cosmos==4.3.1b1
azure-storage-blob==12.17.0b1
azure-storage-file-datalake==12.11.0
backcall==0.2.0
bcrypt==3.2.0
beautifulsoup4==4.11.1
black==22.6.0
bleach==4.1.0
blinker==1.4
blis==0.7.11
boto3==1.24.28
botocore==1.27.28
build==0.10.0
CacheControl==0.13.1
cachetools==5.3.1
catalogue==2.0.10
category-encoders==2.6.0
certifi==2023.7.22
cffi==1.15.1
chardet==4.0.0
charset-normalizer==3.3.0
cleo==2.0.1
click==8.1.7
cloudpickle==2.0.0
cmdstanpy==1.1.0
confection==0.1.3
configparser==5.2.0
convertdate==2.4.0
crashtest==0.4.1
cryptography==37.0.1
cycler==0.11.0
cymem==2.0.8
Cython==0.29.32
dacite==1.8.1
databricks-automl-runtime==0.2.16
databricks-cli==0.17.7
databricks-feature-store==0.13.5
databricks-sdk==0.1.6
dataclasses-json==0.5.8
datasets==2.12.0
dbl-tempo==0.1.23
dbus-python==1.2.18
debugpy==1.5.1
decorator==5.1.1
defusedxml==0.7.1
dill==0.3.4
diskcache==5.6.1
distlib==0.3.7
distro==1.7.0
distro-info==1.1+ubuntu0.1
docstring-to-markdown==0.12
dulwich==0.21.6
einops==0.6.1
entrypoints==0.4
ephem==4.1.4
evaluate==0.4.0
exceptiongroup==1.1.3
executing==1.2.0
facets-overview==1.0.3
fastapi==0.95.0
fastjsonschema==2.17.1
fasttext==0.9.2
filelock==3.12.4
flash-attn==1.0.5
Flask @ https://databricks-build-artifacts-manual-staging.s3.amazonaws.com/flask/Flask-1.1.2%2Bdb1-py2.py3-none-any.whl?AWSAccessKeyId=AKIAX7HWM34HCSVHYQ7M&Expires=2001354391&Signature=bztIumr2jXFbisF0QicZvqbvT9s%3D
flatbuffers==23.5.26
fonttools==4.25.0
frozenlist==1.3.3
fsspec==2023.9.2
future==0.18.2
gast==0.4.0
gitdb==4.0.10
GitPython==3.1.27
google-api-core==2.8.2
google-auth==1.33.0
google-auth-oauthlib==0.4.6
google-cloud-core==2.3.2
google-cloud-storage==2.9.0
google-crc32c==1.5.0
google-pasta==0.2.0
google-resumable-media==2.5.0
googleapis-common-protos==1.56.4
greenlet==1.1.1
grpcio==1.48.1
grpcio-status==1.48.1
gunicorn==20.1.0
gviz-api==1.10.0
h11==0.14.0
h5py==3.7.0
holidays==0.25
horovod==0.28.0
htmlmin==0.1.12
httplib2==0.20.2
huggingface-hub==0.16.4
idna==3.4
ImageHash==4.3.1
imbalanced-learn==0.8.1
importlib-metadata==6.8.0
importlib-resources==5.12.0
installer==0.7.0
ipykernel==6.17.1
ipython==8.10.0
ipython-genutils==0.2.0
ipywidgets==7.7.2
isodate==0.6.1
itsdangerous==2.0.1
jaraco.classes==3.3.0
jedi==0.18.1
jeepney==0.7.1
Jinja2==3.1.2
jmespath==0.10.0
joblib==1.2.0
joblibspark==0.5.1
jsonschema==4.16.0
jupyter-client==7.3.4
jupyter_core==4.11.2
jupyterlab-pygments==0.1.2
jupyterlab-widgets==1.0.0
keras==2.11.0
keyring==24.2.0
kiwisolver==1.4.2
korean-lunar-calendar==0.3.1
langchain==0.0.181
langcodes==3.3.0
launchpadlib==1.10.16
lazr.restfulclient==0.14.4
lazr.uri==1.0.6
lazy_loader==0.2
libclang==15.0.6.1
librosa==0.10.0
lightgbm==3.3.5
llvmlite==0.38.0
LunarCalendar==0.0.9
Mako==1.2.0
Markdown==3.3.4
MarkupSafe==2.1.3
marshmallow==3.19.0
marshmallow-enum==1.5.1
matplotlib==3.5.2
matplotlib-inline==0.1.6
mccabe==0.7.0
mistune==0.8.4
mleap==0.20.0
mlflow-skinny==2.4.2
more-itertools==8.10.0
mpmath==1.3.0
msgpack==1.0.5
multidict==6.0.4
multimethod==1.9.1
multiprocess==0.70.12.2
murmurhash==1.0.10
mypy-extensions==0.4.3
nbclient==0.5.13
nbconvert==6.4.4
nbformat==5.5.0
nest-asyncio==1.5.5
networkx==3.1
ninja==1.11.1
nltk==3.7
nodeenv==1.8.0
notebook==6.4.12
numba==0.55.1
numexpr==2.8.4
numpy==1.25.2
oauthlib==3.2.0
openai==0.27.7
openapi-schema-pydantic==1.2.4
opt-einsum==3.3.0
packaging==23.2
pandas==1.4.4
pandocfilters==1.5.0
paramiko==2.9.2
parso==0.8.3
pathspec==0.9.0
pathy==0.10.2
patsy==0.5.2
peewee==3.16.3
petastorm==0.12.1
pexpect==4.8.0
phik==0.12.3
pickleshare==0.7.5
Pillow==9.2.0
pkginfo==1.9.6
plac==1.1.3
platformdirs==3.11.0
plotly==5.9.0
pluggy==1.0.0
pmdarima==2.0.3
poetry==1.6.1
poetry-core==1.7.0
poetry-plugin-export==1.5.0
pooch==1.7.0
preshed==3.0.9
prodigy @ file:///Workspace/Repos/auto_anotate/prodicy-spacy-LLMs/prodigy-1.13.3-cp310-cp310-linux_x86_64.whl
prompt-toolkit==3.0.36
prophet==1.1.3
protobuf==3.19.4
psutil==5.9.0
psycopg2==2.9.3
ptyprocess==0.7.0
pure-eval==0.2.2
pyarrow==8.0.0
pyasn1==0.4.8
pyasn1-modules==0.2.8
pybind11==2.10.4
pycparser==2.21
pydantic==1.10.13
pyflakes==3.0.1
Pygments==2.11.2
PyGObject==3.42.1
PyJWT==2.8.0
PyMeeus==0.5.12
PyNaCl==1.5.0
pyodbc==4.0.32
pyparsing==3.0.9
pyproject_hooks==1.0.0
pyright==1.1.294
pyrsistent==0.18.0
pytesseract==0.3.10
python-apt==2.4.0+ubuntu2
python-dateutil==2.8.2
python-dotenv==1.0.0
python-editor==1.0.4
python-lsp-jsonrpc==1.0.0
python-lsp-server==1.7.1
pytoolconfig==1.2.2
pytz==2022.1
PyWavelets==1.3.0
PyYAML==6.0.1
pyzmq==23.2.0
rapidfuzz==2.15.1
regex==2023.10.3
requests==2.31.0
requests-oauthlib==1.3.1
requests-toolbelt==1.0.0
responses==0.18.0
rope==1.7.0
rsa==4.9
s3transfer==0.6.0
safetensors==0.3.3
scikit-learn==1.1.1
scipy==1.9.1
seaborn==0.11.2
SecretStorage==3.3.1
Send2Trash==1.8.0
sentence-transformers==2.2.2
sentencepiece==0.1.99
shap==0.41.0
shellingham==1.5.3
simplejson==3.17.6
six==1.16.0
slicer==0.0.7
smart-open==6.4.0
smmap==5.0.0
sniffio==1.3.0
soundfile==0.12.1
soupsieve==2.3.1
soxr==0.3.5
spacy==3.6.0
spacy-legacy==3.0.12
spacy-llm==0.4.3
spacy-loggers==1.0.5
spark-tensorflow-distributor==1.0.0
SQLAlchemy==1.4.39
sqlparse==0.4.2
srsly==2.4.8
ssh-import-id==5.11
stack-data==0.6.2
starlette==0.26.1
statsmodels==0.13.2
sympy==1.12
tabulate==0.8.10
tangled-up-in-unicode==0.2.0
tenacity==8.1.0
tensorboard==2.11.0
tensorboard-data-server==0.6.1
tensorboard-plugin-profile==2.11.2
tensorboard-plugin-wit==1.8.1
tensorflow==2.11.1
tensorflow-estimator==2.11.0
tensorflow-io-gcs-filesystem==0.32.0
termcolor==2.3.0
terminado==0.13.1
testpath==0.6.0
thinc==8.1.12
threadpoolctl==2.2.0
tiktoken==0.4.0
tokenize-rt==4.2.1
tokenizers==0.14.0
tomli==2.0.1
tomlkit==0.12.1
toolz==0.12.0
torch==2.0.1
torchvision==0.14.1+cu117
tornado==6.1
tqdm==4.66.1
traitlets==5.1.1
transformers==4.34.0
trove-classifiers==2023.9.19
typeguard==3.0.2
typer==0.9.0
typing-inspect==0.9.0
typing_extensions==4.5.0
ujson==5.4.0
unattended-upgrades==0.1
urllib3==2.0.6
uvicorn==0.18.3
virtualenv==20.24.5
visions==0.7.5
wadllib==1.3.6
wasabi==1.1.2
wcwidth==0.2.5
webencodings==0.5.1
websocket-client==0.58.0
Werkzeug==2.0.3
whatthepatch==1.0.2
widgetsnbextension==3.6.1
wordcloud==1.9.2
wrapt==1.14.1
xgboost==1.7.5
xxhash==3.2.0
yapf==0.31.0
yarl==1.9.2
ydata-profiling==4.2.0
zipp==3.8.0

Just to rule it out, could you try a fresh install with pip in a fresh virtualenv? Maybe something like:

python3.10 -m venv venv 
source venv/bin/activate
python -m pip install --upgrade pip
python -m pip install prodigy -f https://<LICENSE>@download.prodi.gy 
python -m pip install transformers

From here you should be able to call prodigy from the virtualenv that has the transformers library via:

python -m prodigy ...

The reason for doing this is that virtualenvs can be tricky beasts over time as installations keep getting patched on and sometimes it's just more pragmatic to start fresh. In particular I'd recommend strictly using python -m pip to ensure that you're always running in the correct virtualenv.

If this doesn't work I'll gladly check in again, I just want to check if a fresh install fixes it.

I am having problems activating the environment.

bash: line 2: venv/bin/activate: No such file or directory

If I proceed to use pip with no environment this keeps coming up

ImportError: cannot import name 'DEFAULT_CIPHERS' from 'urllib3.util.ssl_' (/local_disk0/.ephemeral_nfs/envs/pythonEnv-9694275d-8338-4aa9-9198-4242c0d787e9/lib/python3.10/site-packages/urllib3/util/ssl_.py)

Also as it is, I think Databricks cannot allow creation of environments in the normal way.
The venv foder doesn't have the bin folder after creating the environment, hence the bash: line 2: venv/bin/activate: No such file or directory Error

I see. Databricks doesn't allow for any creation of virtual environments? That seems like a strange design choice. Is it absolutely require that you run Prodigy on Databricks then? A "normal" VM ... is that out of the picture?

I am running on databricks because that's where we collaborate with the team.

In addition to that, I opted to run prodigy recipe as a python code

import prodigy

prodigy.serve("ner.llm.fetch  config.cfg input_texts.csv output_texts.jsonl", port=9000)

This works though there are errors of:

ImportError: cannot import name 'TypeVar' from 'typing_extensions' 

This is a typing-extensions error which can be solved.

Happy to hear there's progress! Usually the typing extensions issue can be solved by upgrading it.