Deploying Prodigy on Cloud Platform (Heroku)

Hello, I have been trying to deploy Prodigy on Heroku but couldn't achieve it yet. I'm a bit inexperienced in terms of web application and cloud deployment. Is there a specific guide on how to deploy Prodigy on Heroku, or it's not possible? Found some posts about using a Docker file but don't want that way. Can someone who deployed Prodigy on Cloud, especially on Heroku before give me a hand on this?

Hi! In general, you can deploy Prodigy like any other Python app that starts a web server. I haven't really used Heroku myself, but I found this guide, which looks pretty straightforward:

In your case, the command and Python script you run would be the prodigy command to start the Prodigy server. You just need to make sure that you also upload the Prodigy wheel and specify it in your requirements.txt, so it can be installed on the server: Python Dependencies via Pip | Heroku Dev Center If you need to configure the host and port to run Prodigy on, you can set the PRODIGY_HOST and PRODIGY_PORT environment variables.

I do think that if you're just starting out and haven't done much cloud deployment, Docker could be a good option? It'll take care of setting up the environment for you, so you won't have to worry about any of that. Here's a Dockerfile that might help: Cloud deploy dockerfile

I tried the tutorial "How to Deploy a Python Script or Bot to Heroku in 5 Minutes". But I didn't succeed to start it.

requirements.txt (license changed)

--extra-index-url https://1111-22AB-3344-55CD@download.prodi.gy/index 
index prodigy>=1.11.0,<2.0.0

Procfile

web: prodigy mark.py
worker: prodigy mark.py




file structure in GitHub
β”œβ”€β”€ Procfile
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ mark.py
└── images.jsonl

I get the error:
Can't find recipe or command 'run'

My question would be, how do I run the recipe with my specifications, like I would in the command window. For example, how would I start this recipe on my cloud application?
$prodigy mark fing_lens_images ./images.jsonl --loader jsonl --label GOOD_IMAGE --view-id classification

Do I have to run from within a Python file, something like os.system("prodigy mark...")?

Additionally, do I have to specify the port / location somewhere? Something like this, with a Flask example:

if __name__ == "__main__":
    port = int(os.environ.get("PORT", 5000))
    app.run(host='0.0.0.0', port=port)

Best, Matt

Is the Procfile supposed to include the command to run to start the server? In that case, I think you want to put the recipe call in there, e.g. prodigy mark fing_lens_images ... and so on.

Alternatively, you can also call into prodigy.serve from Python: Components and Functions Β· Prodigy Β· An annotation tool for AI, Machine Learning & NLP

You can specify the host and port via keyword arguments here, or put it in your prodigy.json, or define it via the environment variables PRODIGY_HOST and PRODIGY_PORT.

Yes, my procfile was wrong. Meanwhile I was able to get prodigy started.
I also added the prodigy.json :

{
...
  "port": 80,
  "host": "0.0.0.0",
...
}

Although I'm able to start prodigy now, I don't get the prodigy webinterface on the provided link of the heroku app (e.g. https://myappname.herokuapp.com ). I get the message, that the webserver is launched.
Starting the webserver at http://0.0.0.0:80

I can change the PORT / HOST variables in the environment. Also using the PORT Heroku provides as some suggested via PRODIGY_PORT=$PORT does not change the outcome.

Heroku logs:
heroku[router]: at=error code=H10 desc="App crashed" method=GET path="/"
host=myappname.herokuapp.com request_id=e0dbbb...
fwd="xx.xx.xxx.xxx" dyno= connect= service= status=503 bytes= protocol=https

Sine I am not very familiar with cloud deployment, could it be, that Heroku might be less applicable for prodigy?
Something I found:

Heroku vs Docker
Environment: One of the most important differences between Heroku and Docker is that Heroku must run in its own cloud environment, while Docker can run in an environment of your choiceβ€”whether that's your laptop, a remote server, or a public cloud service like Amazon Web Services (AWS).

(Heroku worked with Flask when I used gunicorn.)

Meanwhile I tried a lot of things, but it didn't work.
Could you recommend me a tutorial or documentation where I find some hints, how to solve this.

Since prodigy hasn't a specific "file.py" which I start, it is different to the deployment of Flask / Django (which works).

Thanks, Matt

Hi @Matt2021 !

Just to make sure that we've covered all bases, can you try the following (in order of importance):

  1. Go to the Heroku settings of your app, then Config Vars, and set WEB_CONCURRENCY to the value of 1 . Heroku seems to default to 2 (for free tier). And the number of workers needed for Prodigy is just 1.

  2. Create a file in your project root, main.py , and call the prodigy.serve command there. Here's a sample of what it looks like:

import prodigy 
import os 

# We should use Heroku's port,not the default version 
port = int(os.environ.get("PORT", 8080)) 
# We should bind to this host, not "localhost" 
host = "0.0.0.0" 

if __name__ == "__main__": 
    prodigy.serve(
        "<TODO>",  # e.g. ner.manual test ...
        host=host, 
        port=port
    )

Then in your Procfile you should add:

web: python main.py

I am pretty sure that your original approach (using prodigy.json and supplying the prodigy command in the Procfile directly) will still work. But just sharing what has worked for me.

  1. If you're still debugging, I also recommend turning on the logs. Although be careful because this might expose any sensitive data you have, especially if your ports are exposed. See: https://prodi.gy/docs/install#debugging-logging

Again, go to the Heroku settings of your app, then Config Vars, and set PRODIGY_LOGGING to verbose

Hello @ljvmiranda921, thank you very much for your help! It worked and displayed the prodigy interface.

Could you give me some further guidance in relation to the database? My goal is to start different sessions for different users, who can annotate images indepedently. Therefore I have to rely on a postgresql database.

So far I have activated the postgres database in Heroku and changed the specific part in the prodigy.json :

{
  "db": "postgresql",
  "db_settings": {
    "postgresql": {
      "dbname": "prodigy",
      "user": "username-given-by-heroku",
      "password": "password-given-by-heroku"
    }
  }
}

But with this change made, I can't access the app anymore. My question would be, do I have to add import psycop2 and establish a connection with psycop2 inside the main.py?

Or do I have to use something similar to "environ.get", because Herokus says:
"Database Credentials: Please note that these credentials are not permanent. Heroku rotates credentials periodically and updates applications where this database is attached."

Best, Matt

Hi @Matt2021 !

Glad it worked :slight_smile:

My question would be, do I have to add import psycop2 and establish a connection with psycop2 inside the main.py?

For the database, you just need to ensure that the driver is installed with the app. Prodigy just needs the driver. Can you check through the logs if it's connecting properly? You can test if there's a connection by running the script here: Database Β· Prodigy Β· An annotation tool for AI, Machine Learning & NLP

"Database Credentials: Please note that these credentials are not permanent . Heroku rotates credentials periodically and updates applications where this database is attached."

Perhaps it's similar to how $PORT works :thinking: , if that's the case, you can try setting the environment variables for Postgres, similar to here: PostgreSQL: Documentation: 16: 34.15. Environment Variables

My goal is to start different sessions for different users, who can annotate images indepedently.

Another option is to still use SQLite with a file on disk. You just need to ensure that the database isn't wiped whenever the app restarts.

Thank you, the hint to the Environment Variables solved it. I entered these variables DATABASE_URL, PGHOST, PGPASSWORD, PGPORT, PGUSER into the Config Vars of Heroku, which worked.

One last question concerning the use of a custom recipe.
I changed the prodigy.serve command in the main.py :

"image-caption-loop data_testset ./load_images.jsonl ./mark_loop.py"

As I read in the posts, I checked the dash of -F, and tested it with and without -F, as well with and without .py-ending ( prodigy.serve does not work with custom recipe ).

I get the error: "✘ Can't find recipe 'image-caption-loop".

The recipe in mark_loop.py looks like this:

@prodigy.recipe(
    "image-caption-loop",
    dataset=("The dataset to save to", "positional", None, str),
    file_path=("Path to images", "positional", None, str),
)

def image_caption_loop(dataset, file_path):
    #blocks of the interface
    blocks = [
        {"view_id": "classification"}
    ]

    def get_stream():
        #stream = JSONL(file_path)     # load in the JSONL file
        for label in ["FIRST_LABEL", "SECOND_LABEL"]:
            examples = JSONL(file_path)          #enter path with executing the recipe like ./img
            for eg in examples:
                eg["label"] = label
                yield eg

    return {
        "dataset": dataset,
        "stream": get_stream(),
        "view_id": "blocks",
        "config": {"blocks": blocks}
    }

Do I have to specify the database somehow in the recipe?
Thanks again for your help!

Hi @Matt2021 ,

Glad it worked!

Just a sanity-check, are we sure the mark_loop.py is being uploaded in the Heroku instance?
Also, does this work locally? For the former, you can check your files by running:

heroku run bash
ls .

:thinking: I don't think you need to specify the database.

The mark_loop.py file is uploaded in the Heroku instance, as I checked again.
Also, I can run the custom recipe directly on my local computer with: python -m prodigy image_caption_loop data_testset ./load_images.jsonl -F mark_loop.py

But, if I started the custom recipe via the main.py locally it generates the same error: ✘ Can't find recipe 'image-caption-loop. Nevertheless I can start one of prebuilt prodigy recipes locally via the main.py, like: image.manual images_dataset ..., which gets found and runs.

Solution
I saw this post suggesting to add the custom recipe to the main.py with the serve command: prodigy.serve does not work with custom recipe - #2 by ines - which works :slightly_smiling_face:

So I will test it, to see, if the everything works with the database.

Once again, thank you @ljvmiranda921 for your help!

Edit - solved
Is there a workaround to use the command "--remove-base64"? Because it also doesn't get recognized within the prodigy.serve, when I use the custom recipe.

This post with a function def before_db(examples) solved it: Labelling a set of images (classification) - #3 by strickvl

Hi @ljvmiranda921 ,

After annotating on Heroku, how do I pull the annotations to my local computer?

On my Heroku app, I have the following saved annotations to "main-db" dataset.
image

However, when I run heroku run prodigy stats -ls
I get the following results

============================== ✨  Prodigy Stats ==============================

Version          1.11.6                        
Location         /app/.heroku/python/lib/python3.10/site-packages/prodigy
Prodigy Home     /app/.prodigy                 
Platform         Linux-4.4.0-1101-aws-x86_64-with-glibc2.31
Python Version   3.10.4                        
Database Name    SQLite                        
Database Id      sqlite                        
Total Datasets   0                             
Total Sessions   0         
1 Like

Hmm, it's quite unusual that the number of total datasets aren't registering in the app. To be sure, you can probably download the database file itself from /app/.prodigy/prodigy.db. Is main-db a SQLite database or did you configure something on Heroku to use a different backend?

I'm not well-versed with Heroku, but I remember that you can use something like ps:copy to pull files from a Dyno server.

1 Like

@vinitrinh Hello! Were you able to resolve this issue of seeing 0 datasets/sessions when running heroku run prodigy stats?

Hello all. I have the same problem as @vinitrinh. After deploying Prodigy to Heroku (thanks to @Matt2021 and @ljvmiranda921 ) I still cannot make the database work. Everything seems to work fine and annotations seem to be saved to the custom dataset but when I run heroku run prodigy stats I get 0 databases:

My main file:

My prodigy.json file which I added to the /app folder (I tried to override the default prodigy.json - not sure if it was a good I idea to customize it this way with another copy in the app folder):

@miladrogha in the future, please refrain from posting screenshots. These are impossible to copy/paste, often harder to read and they won't be indexed by search engines.

Tagging team members on the forum also won't guarantee that they'll be able to respond. We're a team that's handling the question and the person who responds depends on availability.

I'm not that familiar with the Heroku platform, but I wonder if you've deployed Prodigy as a serverless service. If so, the state may be lost after a while because the containers can spin up or down. Since SQlite is typically stored on disk, you may need to configure a postgresql database hosted by Heroku instead.

Thanks @koaning . Sorry for the confusion.

So I switched to using Heroku's Postgresql. The only problem is with the data format stored in the database. that the data format in the content section of Examples table is weird:

\x7b2274657874223a22447572696e6720323032312077652067656e65726174656420726576656e7565206f662024362e322062696c6c696f6e2c2075702034252066726f6d20323032302e222c225f696e7

I was able to resolve the issue. It seems like it is just the way that PostgreSQL stored the data. To access the data from your local terminal you can use db-out and write:

heroku run prodigy db-out name-of-your-dataset > <output-path> --dry

For example:

prodigy db-out Db1 > ~/res.jsonl --dry

This stores the annotations in the dataset (for example "Db1") in a jsonl format which you can use easily later.

More on the raw (it is actually "bytea" data type ) : bytea type

2 Likes

Hi @miladrogha! Thanks for posting your solution! Let us know if you have any further questions.