Deploy prodigy using Kubernetes in Google Cloud

Hi,
I am trying to deploy the prodigy instance in gcp using kubernetes. It uses postgresql database in gcp. For this I am using prodigy.serve and FastApi. This python script executes successfully, that is the prodigy annotation tool starts up and can be viewed in a browser. But when we close the browser and try to start again, it shows the internal server error. Also in the log we can see that system is restarting again and again. Could someone help me to resolve this issue?

Could you share some of the logs, tracebacks, as well as the code that you're using for the start-up Prodigy? Feel free to anonymise anything that's sensitive, but it'll be much easier to think along with that information.

I've found an answer related to Prodigy and Kubernetes with some general tips that you may also find useful here.

Following is the logs,

Scanned up to 17/05/2022, 08:53. Scanned 12.7 MB.
No newer entries found matching current filter.
Error
2022-05-17T06:56:44.691587884ZINFO: Waiting for application shutdown.
Error
2022-05-17T06:56:44.691769742ZINFO: Application shutdown complete.
Error
2022-05-17T06:56:44.692255206ZINFO: Finished server process [1]
Error
2022-05-17T06:56:44.694544863ZINFO: Started server process [1]
Error
2022-05-17T06:56:44.694589542ZINFO: Waiting for application startup.
Error
2022-05-17T06:56:44.694839476ZINFO: Application startup complete.
Error
2022-05-17T06:56:44.695149013ZINFO: Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)
Info
2022-05-17T06:56:56.290240662ZINFO: ****************** - "GET /health HTTP/1.1" 200 OK
Info
2022-05-17T06:56:56.534631690ZINFO: ****************** - "GET /health HTTP/1.1" 200 OK
Info
2022-05-17T06:56:59.491370261ZINFO: ****************** - "GET /health HTTP/1.1" 200 OK
Info
2022-05-17T06:57:11.290891312ZINFO: ****************** - "GET /health HTTP/1.1" 200 OK
Info
2022-05-17T06:57:11.536171959ZINFO: ****************** - "GET /health HTTP/1.1" 200 OK
Info
2022-05-17T06:57:12.179595780ZINFO: ****************** - "GET /health HTTP/1.1" 200 OK
Info
2022-05-17T06:57:14.492161815ZINFO: ****************** - "GET /health HTTP/1.1" 200 OK
Error
2022-05-17T06:57:14.523124382Z2022/05/17 06:57:14 Client closed local connection on 127.0.0.1:5432
Error
2022-05-17T06:57:19.039014329Ze[1;38;5;135m06:57:19e[0m: INIT: Setting all logging levels to 10
Error
2022-05-17T06:57:20.029215903Ze[1;38;5;135m06:57:20e[0m: RECIPE: Calling recipe 'topic-annotation'
Error
2022-05-17T06:57:20.029346902Ze[1;38;5;135m06:57:20e[0m: CONFIG: Using config from global prodigy.json
Error
2022-05-17T06:57:20.029697187Z/app/prodigy.json
Error
2022-05-17T06:57:20.029703284Z{}
Error
2022-05-17T06:57:20.029819141Ze[1;38;5;135m06:57:20e[0m: VALIDATE: Validating components returned by recipe
Error
2022-05-17T06:57:20.030269052Ze[1;38;5;135m06:57:20e[0m: CONTROLLER: Initialising from recipe
Error
2022-05-17T06:57:20.030295676Z{'before_db': None, 'config': {'blocks': [{'view_id': 'choice'}], 'choice_style': 'multiple', 'exclude_by': 'input', 'dataset': 'text_topics_selected_1', 'recipe_name': 'topic-annotation', 'db': 'postgresql', 'db_settings': {'postgresql': {'user': 'prodigycloudtest', 'password': '', 'host': '127.0.0.1', 'port': 5432, 'dbname': 'prodigydb'}}, 'feed_overlap': False, 'show_stats': True, 'port': 8080}, 'dataset': 'text_topics_selected_1', 'db': True, 'exclude': None, 'get_session_id': None, 'metrics': None, 'on_exit': None, 'on_load': None, 'progress': <prodigy.components.progress.ProgressEstimator object at 0x7fc331ee3fa0>, 'self': <prodigy.core.Controller object at 0x7fc331ef4070>, 'stream': <generator object topic_annotation.<locals>.get_stream at 0x7fc331eec5f0>, 'update': None, 'validate_answer': None, 'view_id': 'blocks'}
Error
2022-05-17T06:57:20.030303639Z{}
Error
2022-05-17T06:57:20.030316269Ze[1;38;5;135m06:57:20e[0m: VALIDATE: Creating validator for view ID 'blocks'
Error
2022-05-17T06:57:20.030407045Ze[1;38;5;135m06:57:20e[0m: VALIDATE: Validating Prodigy and recipe config
Error
2022-05-17T06:57:20.139177375Ze[1;38;5;135m06:57:20e[0m: DB: Connecting to database PostgreSQL
Error
2022-05-17T06:57:20.147630241Ze[1;38;5;135m06:57:20e[0m: DB: Creating dataset '2022-05-17_06-57-20'
Error
2022-05-17T06:57:20.147670022Z{'created': datetime.datetime(2022, 4, 7, 7, 25, 1)}
Error
2022-05-17T06:57:20.147676708Z{}
Error
2022-05-17T06:57:20.427986990Ze[1;38;5;135m06:57:20e[0m: FEED: Initializing from controller
Error
2022-05-17T06:57:20.428704996Z2022/05/17 06:57:20 New connection for "*******************"
Error
2022-05-17T06:57:20.515482262Ze[1;38;5;135m06:57:20e[0m: CORS: initialized with wildcard "*" CORS origins
Info
2022-05-17T06:57:20.515709667Z{}
Info
2022-05-17T06:57:20.515738302Z✨ Starting the web server at http://0.0.0.0:8080 ...
Info
2022-05-17T06:57:20.515745374ZOpen the app in your browser and start annotating!
Info
2022-05-17T06:57:20.515749617Z{}
Error
2022-05-17T06:57:20.525426819ZINFO: Started server process [1]
Error
2022-05-17T06:57:20.525500775ZINFO: Waiting for application startup.
Error
2022-05-17T06:57:20.525709962ZINFO: Application startup complete.
Error
2022-05-17T06:57:20.526174754ZINFO: Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)
Info
2022-05-17T06:57:26.295673617ZINFO: ****************** - "GET /health HTTP/1.1" 404 Not Found
Info
2022-05-17T06:57:26.537888100ZINFO: ****************** - "GET /health HTTP/1.1" 404 Not Found
Error
2022-05-17T06:57:59.648300027ZINFO: Waiting for application shutdown.
Error
2022-05-17T06:57:59.648423267ZINFO: Application shutdown complete.
Error
2022-05-17T06:57:59.648523392ZINFO: Finished server process [1]
Error
2022-05-17T06:57:59.650428626ZINFO: Started server process [1]
Error
2022-05-17T06:57:59.650457319ZINFO: Waiting for application startup.
Error
2022-05-17T06:57:59.650676371ZINFO: Application startup complete.
Error
2022-05-17T06:57:59.650979938ZINFO: Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)
Info
2022-05-17T06:58:11.293110099ZINFO: ****************** - "GET /health HTTP/1.1" 200 OK
Info
2022-05-17T06:58:11.539315377ZINFO: ****************** - "GET /health HTTP/1.1" 200 OK

Info
2022-05-17T06:58:12.184281822ZINFO: ****************** - "GET /health HTTP/1.1" 200 OK
Info
2022-05-17T06:58:14.491757079ZINFO: ****************** - "GET /health HTTP/1.1" 200 OK
Info
2022-05-17T06:58:26.294572128ZINFO: ****************** - "GET /health HTTP/1.1" 200 OK
Info
2022-05-17T06:58:26.540475132ZINFO: ****************** - "GET /health HTTP/1.1" 200 OK
Info
2022-05-17T06:58:27.185292101ZINFO: ****************** - "GET /health HTTP/1.1" 200 OK
Info
2022-05-17T06:58:29.492209611ZINFO: ****************** - "GET /health HTTP/1.1" 200 OK

In your first message you are referring to an internal server error, but I don't see it appear in the logs. Are these the logs from your custom FastAPI service that starts Prodigy? Could share any traceback? Could you also share what Prodigy recipe you're running?

Also, can you confirm that you're running modern versions of Python/FastAPI/Prodigy? Does this error persist when you use a different browser? The Brave browser seems to have caused issues in the past.

Now I am getting not found error.

The logs that I shared was from the kubernetes when I run the application.
Following is the code that I am running,

app = FastAPI()


@app.get("/health")
async def health():
    """
    Shows application health information.
     For testing purpose
    """
    status_code = status.HTTP_200_OK
    return status_code


def add_options(task, options):
    """Helper function to add options to every task in a stream."""
    options = [{"id": option, "text": option} for option in options]
    task['label'] = task['source']
    task["options"] = options
    return task


@prodigy.recipe("topic-annotation")
def topic_annotation(dataset, file_path):
    def get_stream():
        stream = JSONL(file_path)
        for eg in stream:
            # If there are no options for the item, we need to skip it
            # (Otherwise there will be a very strange error message)
            if not eg["main_topics"]:
                continue

            eg = add_options(eg, eg['main_topics'])

            # Somehow we need to explicitly set the hashes for
            # the filtering to work (seems to be a bug),
            # otherwise already annotated items
            # will show when restarting the task
            eg = set_hashes(eg)
            yield eg

    blocks = [
        {"view_id": "choice"},
    ]
    return {
        "dataset": dataset,
        "stream": get_stream(),
        "view_id": "blocks",
        "config": {"blocks": blocks, "choice_style": "multiple",
                   "exclude_by": "input"}
    }
model = "topic-annotation"
dataset_name = "text_topics_selected_1"
json_file = "topic_dataset_v2_selected_1.jsonl"
host = "0.0.0.0"
port = 8080
prodigy.serve(model, dataset_name, json_file, port=port)

The version that I use are the following,
Python :3.8
FastAPI: 0.74.1
Prodigy: How can I check which version I am using here? Because I am installing the prodigy in my docker.

I was testing this in my chrome browser. I have also tried using mozilla firefox. But, still I am getting the same error.

Now I am getting not found error.

Could you share the full error with traceback? That'll make it much easier for me to try and reproduce and find the bug.

You can find the Prodigy version via Python code too, via;

import prodigy
print(prodigy.__version__)

One thing I'm noticing about the way that you're running Prodigy is that you seem to be using the deprecated *args. To quote the relevant docs:

As of v1.9, the prodigy.serve function also takes a string in the same format and style as the command-line recipe commands . In v1.8 and below, you have to pass in the recipe name as the first argument, followed by all recipe arguments in positional order. This was inconvenient and could easily lead to unexpected results.

We now recommend using serve with a single string argument, such that you can mimic the command line.

I also noticed from the comments that you're assuming it's a bug that you manually need to set the hashes. It's explained in more detail on our docs, but the hashes may deserve to be set manually because your custom task may have custom input_keys and task_keys.

Also, could you show me the Docker container? Or at least how you're running your script inside of it? It seems like you've added an app that only contains the health endpoint, which won't run when you're calling the script via;

python yourscript.py

This lack of a health endpoint might explain the behaviour that you're seeing.

Then again, if you'd use uvicorn to run your fastapi app, I get errors when I try to run a variant of your application.

> uvicorn fastprodigy:app

Traceback (most recent call last):
  File "/home/vincent/Development/prodigy-bad-img-demo/venv/bin/uvicorn", line 8, in <module>
    sys.exit(main())
  File "/home/vincent/Development/prodigy-bad-img-demo/venv/lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/vincent/Development/prodigy-bad-img-demo/venv/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/vincent/Development/prodigy-bad-img-demo/venv/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/vincent/Development/prodigy-bad-img-demo/venv/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/vincent/Development/prodigy-bad-img-demo/venv/lib/python3.8/site-packages/uvicorn/main.py", line 362, in main
    run(**kwargs)
  File "/home/vincent/Development/prodigy-bad-img-demo/venv/lib/python3.8/site-packages/uvicorn/main.py", line 386, in run
    server.run()
  File "/home/vincent/Development/prodigy-bad-img-demo/venv/lib/python3.8/site-packages/uvicorn/server.py", line 49, in run
    loop.run_until_complete(self.serve(sockets=sockets))
  File "/usr/lib/python3.8/asyncio/base_events.py", line 616, in run_until_complete
    return future.result()
  File "/home/vincent/Development/prodigy-bad-img-demo/venv/lib/python3.8/site-packages/uvicorn/server.py", line 56, in serve
    config.load()
  File "/home/vincent/Development/prodigy-bad-img-demo/venv/lib/python3.8/site-packages/uvicorn/config.py", line 308, in load
    self.loaded_app = import_from_string(self.app)
  File "/home/vincent/Development/prodigy-bad-img-demo/venv/lib/python3.8/site-packages/uvicorn/importer.py", line 20, in import_from_string
    module = importlib.import_module(module_str)
  File "/usr/lib/python3.8/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
  File "<frozen importlib._bootstrap>", line 991, in _find_and_load
  File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 671, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 783, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "./fastprodigy.py", line 47, in <module>
    prodigy.serve(model, dataset_name, json_file, port=port)
  File "/home/vincent/Development/prodigy-bad-img-demo/venv/lib/python3.8/site-packages/prodigy/__init__.py", line 50, in serve
    server(controller, controller.config)
  File "/home/vincent/Development/prodigy-bad-img-demo/venv/lib/python3.8/site-packages/prodigy/app.py", line 549, in server
    uvicorn.run(
  File "/home/vincent/Development/prodigy-bad-img-demo/venv/lib/python3.8/site-packages/uvicorn/main.py", line 386, in run
    server.run()
  File "/home/vincent/Development/prodigy-bad-img-demo/venv/lib/python3.8/site-packages/uvicorn/server.py", line 49, in run
    loop.run_until_complete(self.serve(sockets=sockets))
  File "/usr/lib/python3.8/asyncio/base_events.py", line 592, in run_until_complete
    self._check_running()
  File "/usr/lib/python3.8/asyncio/base_events.py", line 552, in _check_running
    raise RuntimeError('This event loop is already running')
RuntimeError: This event loop is already running
sys:1: RuntimeWarning: coroutine 'Server.serve' was never awaited

Following is the dockerfile,

FROM python:3.8-slim

WORKDIR /app

RUN pip install --upgrade pip
&& pip install psycopg2-binary
&& pip install sqlalchemy

RUN pip install prodigy -f https://*********@download.prodi.gy
COPY prodigy.json /app/
COPY requirements.txt /app/
RUN pip install --upgrade pip
&& pip install -r /app/requirements.txt
COPY topic_dataset_v2_selected_1.jsonl /app/
COPY setup.py /app/
ENV PRODIGY_HOME /app
ENV PRODIGY_LOGGING "verbose"
ENV PRODIGY_ALLOWED_SESSIONS "philsy1,philsy2,philsy3"
ENV PRODIGY_BASIC_AUTH_USER "admin"
ENV PRODIGY_BASIC_AUTH_PASS "password"
ENV PRODIGY_HOST="0.0.0.0"
ENV PRODIGY_PORT=8080
COPY prodigycloudtest/app.py /app/
CMD ["python", "-u", "app.py"]
#CMD python -m ./prodigycloudtest/app.py
#CMD python -m prodigy topic-annotation text_topics_selected_1 ./topic_dataset_v2_selected_1.jsonl -F ./app.py

Right. I think what's happening is that your script is running without a health endpoint, which is why Kubernetes may think the container isn't functioning and it is trying to restart it.

When you run the script like this:

CMD ["python", "-u", "app.py"]

Then the FastApi app isn't running.

Is it possible to point the health endpoint to the root URL on Kubernetes' side? Can you check if the Kubernetes logs confirm that it's constantly restarting the service?

Right. I think what's happening is that your script is running without a health endpoint, which is why Kubernetes may think the container isn't functioning and it is trying to restart it.

I have added the healthcheck in the code. So what could be the reason that FastAPI is not running?

I have pointed the healthpoint to the root url. Then the status(200) is shown in the annotating page. But how can I invoke the prodigy then?

As far as I'm aware you'd define an app with FastAPI like so;

from fastapi import FastAPI

app = FastAPI()

@app.get("/")
def root():
    return {"message": "hello world again"}

But you'd run it using a service like uvicorn.

uvicorn app:app --reload

Merely running it like a Python script won't cause the server to start.

I have pointed the healthpoint to the root URL.

When you say "added the healthpoint", are you referring to your Kubernetes config? I'm also curious; why did you attach a FastAPI server to your script? I am assuming that you did this only to add a health endpoint but if there's another reason I'll gladly hear it.

Yes. I have added this to add a health endpoint to kubernetes. Now I have managed to run prodigy without any issues. When we run

uvicorn app:app --reload

do we have to mention the port? When I add the port i get the error in prodigy that the port is already in use.

I'm happy to hear Prodigy runs without issues now.

With regards to the port, I'm assuming that you're running Prodigy on port 8000? If so, that port can't be used by FastAPI as well.

However, given that your Kubernetes cluster currently points to the root URL of Prodigy as a health service, I think you no longer need the extra FastAPI service.

I am running prodigy in 8080. I checked this with other ports as well. It is working right now Thank you for the support.

I have a query regarding the writing annotation to the database. Is it possible to write the annotated data directly to BigQuery from prodigy?

Prodigy doesn't natively support BigQuery as a backend. I think BigQuery could theoretically support the data that Prodigy stores, but if you want direct support you will need to implement your own database implementation.

There may be some edge cases to double-check though. The main thing I can think of is that you can store images in base64 encoding in Prodigy. In practice, this is just a very long string, but I don't know how large images might impact upper limits. Their docs suggest that there's a 10MB limit per column value so for many use-cases this feels fine, but it's an edge case to be aware of if you're working with images.

Another alternative is to work with a scheduler, like cron. This could trigger a Python script that runs maybe daily, to upload new annotations to BigQuery. If you're running on the Google Cloud stack, you might want to use a cloud function and cloud scheduler for this task.

1 Like

Hi,
I have a query regarding reading data from GCS bucket. I can read single file from gcs bucket as input file. But I would like to know is it possible to read the files one by one in a particular folder in gcs bucket?

I can read single file from gcs bucket as input file. But I would like to know is it possible to read the files one by one in a particular folder in gcs bucket?

Could you clarify what command you're running when you refer to a gcs bucket as an input file? The simplest way to pass data to Prodigy is by using .jsonl files on disk. These can be fetched beforehand from a storage bucket, but they need to be downloaded upfront.

Sorry for the late reply. Thank you for the info. Like you mentioned I tried downloading the blob and read json and it worked.
input_data_string = blob.download_as_string()
json_data = ndjson.loads(input_data_string)

1 Like

Happy to hear it! And thanks for letting us know :slight_smile: