Prodigy on Google Colab

I assume I've successfully installed Prodigy on google colab and wanted to use spaCy-llm from google colab and use Prodigy to help with annotation for NER. My initial test following the tutorial is giving me the following error - that's a bit weird as the said port number is not in use at all. Any prompt help is appreciated.

!dotenv run -- python -m prodigy ner.llm.correct annotated-food config.cfg examples.jsonl

2023-12-06 05:14:33.681822: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2023-12-06 05:14:33.681872: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2023-12-06 05:14:33.681908: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2023-12-06 05:14:33.690005: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-12-06 05:14:34.876275: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT 2023-12-06 05:14:36.221091: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 2023-12-06 05:14:36.221572: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 2023-12-06 05:14:36.221776: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 Getting labels from the 'llm' component Using 3 labels: ['DISH', 'EQUIPMENT', 'INGREDIENT'] :sparkles: Starting the web server at http://localhost:8080 ... Open the app in your browser and start annotating! ERROR: [Errno 98] error while attempting to bind on address ('127.0.0.1', 8080): address already in use

I would focus on the tensorflow issue first. I have only used PyTorch. Are you able to use tensorflow on a local machine?

Hi @Fantahun,

As I posted on the earlier thread,

I think the concern is whether you're allowed to web host apps like Prodigy on Colab based on their terms.

However, I'm curious about why you're considering Colab when there are many other better app hosting environments out there already. As the FAQ shows, Colab is really for interactive data analysis, not app hosting.

We've written deployment docs that show (as one example) how to use Digital Ocean. Alternatively, you could use Heroku too. Granted, these may cost some money -- a few dollars -- but that cost will save you a ton of headaches.

Alternatively, if you're trying to run Prodigy for other annotators not connected to your network/machine, we also mention how you can use grok to run Prodigy locally but provide over a public URL so other colleagues can annotate.

If you're solely trying to avoid any cost, I'd consider Hugging Face spaces or Modal. We've been playing around with Modal recently and it's a great service. Have you tried one of these?

Thank you @ryanwesslen. I will explore those options and post my experience.

Dear @ryanwesslen, and everyone,

I'm updating my question to: Using spaCy-llm and Prodigy

Let me explain the exact reason I’m using Prodigy for. I’ve a ton of news articles data on which I want to run Named Entity Recognition (NER) (and later entity relationship). Performing manual annotation is not feasible, if not impossible as my entity types need SME knowledge and not pretty straightforward. For this reason, I planned to use spaCy-llm to support me with the annotation as well. I came across this tutorial for using spaCy-llm (https://youtu.be/SuFAXOgw35U?si=D70PXzX3r_SqhJBR) with prodigy and it looks interesting as the LLM can support you from behind in annotation while prodigy gives you the visualization to correct the annotation process. I’m looking for a cost effective, but powerful platform to run both spaCy, spaCy-llm, and prodigy all in one environment. I’ve purchased the Prodigy pro license as well.

FYI I’ve a laptop with 64GB RAM, 1.5 TB SSD, Core I7 8 logical processors and 2 GPUs with the following specs:

AdapterRAM DeviceID Name VideoArchitecture
2147483648 VideoController1 NVIDIA Quadro M1000M 5
1073741824 VideoController2 Intel(R) HD Graphics 530 5

Can I try using everything on my laptop rather than trying to look for cloud resources to train my model?

At one point I tried to experiment with VS Code on my laptop and had a hard time running tensorflow and other related programs. For the latter one, I guess I'll have to dig deeper or please point me to the right information if there are known issues running the chain of those programs from a Windows 10 laptop.

Your response is much appreciated and it means a lot to me.

Thank you!

hi @Fantahun!

Thanks for the background. Sounds like you have a great local setup.

I would definitely try to setup first locally, especially if you don't have to have other human annotators. By moving to the cloud, there's a ton of non-NLP/infra details you'll need to handle (e.g., appropriately handling secrets so your OpenAI or other API keys aren't leaked, network security, networking/port, perhaps domains, managing cloud resources, etc.).

It does sound like your problems locally may be more due to Windows, which can be a bit more tricky in handling. Personally, I moved away from Windows a long time ago and only use Linux/Mac because I find almost any setup is easier via UNIX than Windows. So yes, perhaps a cloud can help you do that, but is it worth adding all of the potential cloud infra details?

If you do need to move the cloud, I'd definitely recommend Digital Ocean (some costs but worth it and you can follow our Deployment instructions) or perhaps Modal if you need a free tier (but you may it hit soon if you have large compute needs).

If you're using OpenAI annotations for spacy-llm and okay with paying and submitting your data to OpenAI, you may not need that much compute locally. I use that framework all the time on my laptop and rarely ever use my GPU. In general, we tend to always recommend to first get initial annotations (e.g., perhaps correction OpenAI) then train with CPU/small first. This makes sure you get good quality data before trying to handle GPUs (e.g., installing drivers which can sometimes be more of a pain than you think). After that point when you're finding adding additional annotations are leveling off performance, then start to move to transformers with GPU.

One thing to point out is you likely will want to use the Prodigy LLM fetch recipes; this will download the OpenAI annotations so you can easily reuse and potentially save on calling OpenAI.

If you begin having GPU specific questions (e.g., perhaps you use OpenAI to pre-annotate, but then want to train with a spaCy transformer), be sure to use the spaCy GitHub discussion forum. Likely from a compute perspective your issues will be spaCy, not Prodigy. That discussion has a lot more spaCy specific problems, especially handling with GPU.

Hope this helps!

Thank you very much for your time and detailed explanation, @ryanwesslen. I will proceed setting up a dual boot and work locally on Linux. I will share my results. Thank you again