Getting jupyterlab-prodigy to work on a google cloud platform VM

I want to be able to have a Prodigy project on a virtual machine (Google Cloud Platform Vertex AI Workbench to be specific) so I can manually label on any machine.

I installed jupyterlab-prodigy, but when I start trying to annotate, the Prodigy tab says "localhost refused to connect". Here are some things I've tried:

  • Tried changing host = "*" in the prodigy.json by creating a custom_prodigy.json and doing export PRODIGY_CONFIG="custom_prodigy.json". The json looks like this:
{
  "theme": "basic",
  "custom_theme": {},
  "buttons": ["accept", "reject", "ignore", "undo"],
  "batch_size": 10,
  "history_size": 10,
  "port": 8000,
  "host": "*",
  "cors": true,
  "db": "sqlite",
  "db_settings": {},
  "validate": true,
  "auto_exclude_current": true,
  "instant_submit": false,
  "feed_overlap": false,
  "auto_count_stream": false,
  "total_examples_target": 0,
  "ui_lang": "en",
  "project_info": ["dataset", "session", "lang", "recipe_name", "view_id", "label"],
  "show_stats": false,
  "hide_meta": false,
  "show_flag": false,
  "instructions": false,
  "swipe": false,
  "swipe_gestures": { "left": "accept", "right": "reject" },
  "split_sents_threshold": false,
  "html_template": false,
  "global_css": null,
  "javascript": null,
  "writing_dir": "ltr",
  "show_whitespace": false,
  "exclude_by": "task"
}

That's to say that going to http://localhost:8000 didn't work, nor did putting in my IP address with port 8000, nor the IP address of the VM.

  • Tried ensuring the port I was using was open and not in use
  • Tried changing the prodigyConfig url to make sure it was working correctly there in the Advanced Settings Editor. It looks like this now under the "User Preferences" tab of the "Prodigy Jupyter Extension" section of the Advanced Settings Editor:
{    "prodigyConfig": {
        "url": "http://localhost:8000"
    }
}
  • Tried opening it up in a separate browser tab
  • Tried curling the url which worked:
 > curl localhost:8000
<!DOCTYPE html>
<html lang="en">
    <head>
        <meta charset="UTF-8">
        <meta name="viewport" content="width=device-width, initial-scale=1.0">
        <meta name="apple-mobile-web-app-capable" content="yes">
        <link rel="shortcut icon" href="favicon.ico">
        <title>Prodigy</title>
    </head>

    <body>
        <div id="root"></div>
        <script src="bundle.js"></script>
    </body>
</html>
(base) jupyter@test-bert:~$

All of which has not let me be able to get this working. How do I get this to work?

hi @sangersteel!

Thanks for your question and the details!

Was there a specific reason why you changed the port to 8000 from 8080? You did a good job of modifying the prodigyConfig url so I don't think it is the problem. However, I'm more curious if you kept the default and don't have something else running on 8080, if it would work on the default 8080 port.

I found a related post:

One thing to check is enabling HTTP/HTTPS traffic:

It might also be better if you enable HTTP (or HTTPS) traffic in your Google Cloud VM (or the VM where the Vertex AI notebook runs).

But taking a step back, would you really want to set host = "*" for Prodigy?

I suspect you did this so you can open this up the port for Prodigy to the outside -- however, since you're using jupyterlab, it would be jupyterlab to have its port open, not Prodigy (since you're really looking at jupyterlab that just so happens to have Prodigy locally ported in). To test this, does it work if you revert back to host = "localhost"? Or alternatively like that post recommends, host = "0.0.0.0"?

I'm still not confident this will solve it but at least it'll rule out some issues. Let us know if any of the above works or doesn't work.

Thanks for your reply, Ryan. I checked the firewall rules for HTTP and HTTPS traffic:

So I don't see this being the problem.

I changed it to host = "0.0.0.0"and it still didn't work.

I'm not using port 8080 because when I try to I get the error message:

⚠ Config setting 'exclude_by' defined in recipe is overwritten by a
different value set in the global or local prodigy.json. This may lead to
unexpected results and potentially changes to the core behavior of the recipe.
If that's surprising, you should probably remove the setting 'exclude_by' from
your prodigy.json.

✨  Starting the web server at http://localhost:8080 ...
Open the app in your browser and start annotating!

ERROR:    [Errno 98] error while attempting to bind on address ('127.0.0.1', 8080): address already in use

If I do ps aux | grep 8080 I get:

root       345  0.0  0.0   8080  4720 ?        Ss   15:17   0:00 /usr/sbin/haveged --Foreground --verbose=1 -w 1024
jupyter  23942  0.0  0.0   4964   880 pts/0    S+   16:58   0:00 grep 8080

hi @sangersteel!

I'm curious. Did you get this error serving Prodigy in jupyter or terminal?

Related, if you shut down jupyter, are you able to serve Prodigy on port 8080?

Ideally you should be able to run on port 8080. While in theory the port shouldn't matter if you update everything else, I think if we can figure out why the port is in use we could err on defaults.

I'm not sure what that root process is. Would it be possible to kill that process?

However, others have found the problem is shutting down Prodigy in jupyter. Here's one approach:

Last three ideas... but I'm not confident they'll do anything:

  • Any chance you're running NGINX? If so, it could be the reason:
  • You should open through the launcher (red circle), did you try this?

  • what version of jupyterlab are you running? It should be 2.0.0 +.

Hopefully something above will help you but let us know if you still can't get things to work.

I'm using version 3.2.9 of jupyter lab. I noticed HTTP and HTTPS was disabled but even upon allowing it I still get the same error. I'm using Vertex AI's Workbenches which have jupyter lab pre-installed. I've been serving Prodigy in terminal within jupyter lab.

From checking use on port 8080:

netstat -plten |grep python3.7

(Not all processes could be identified, non-owned process info
 will not be shown, you would have to be root to see it all.)
tcp        0      0 127.0.0.1:8080          0.0.0.0:*               LISTEN      1006       110009     13876/python3.7     
tcp6       0      0 ::1:8080                :::*                    LISTEN      1006       110010     13876/python3.7

Trying to kill these processed crashed my terminal. Whenever I make a new notebook using Vertex AI Workbench, it seems to have 8080 being used by what I listed above.

I tried a clean Debian VM and it still didn't work, although maybe I'm not doing all the steps perfectly. I wish there was some guide for this or something as my best way of following this is by piecing together discussion threads and praying it works. I tried host: "*" and tried to connect to localhost and it once again didn't work in that VM.

Still stuck. Still frustrated.

Hi @sangersteel , my hunch is that in Vertex AI, the 8080 port is reserved by the haveged process as seen in your ps aux output. I am not sure what its purpose is exactly, but treat the 8080 port to be "reserved" for now.
That's why you cannot start Prodigy because that address is already being used. So you might want to:

  • Change the PRODIGY_PORT when running Prodigy from 8080 into something not reserved (try 8888 or 8889). Something like (or you can do so via the prodigy.json file):
PRODIGY_PORT=8888 prodigy ...
  • Then configure the Jupyter lab extension so that it sees that port where Prodigy is running as mentioned here:

To be sure, you might want to try this in a new VM in case you've shutdown some reserved processes. Just for a cleaner slate. Hope that helps!

This has been doing my head in, so I'm going to explicitly outline everything I'm doing:

  1. Initialized a new Python3 VM with HTTP and HTTPS communications enabled
  2. Installed prodigy on the VM using my license key.
  3. pip install jupyterlab-prodigy
  4. Went into Advanced Settings Editor and set the following under "User Preferences":
{    "prodigyConfig": {
        "url": "http://localhost:8888"
    }
}
  1. Initialized my Prodigy server:
PRODIGY_PORT=8888 python3 -m prodigy ner.manual ner_test blank:en train.jsonl --label label1,label2,label3
Using 3 label(s): label1,label2,label3
Added dataset ner_test to database SQLite.

✨  Starting the web server at http://localhost:8888 ...
Open the app in your browser and start annotating!
  1. Went to "Open Prodigy" in my Launcher. I get "localhost refused to connect".

Okay. Now I'll try something else.

  1. Put http://localhost:8888 in my web browser. The following page pops up:

  1. Odd. Can I get past this? I input jupyter notebook list to try and see if I can find a token to feed it and it's literally blank.
(base) jupyter@prodigy-test:~$ jupyter notebook list
Currently running servers:
  1. Maybe I need to make my host="*" or host="0.0.0.0" ?

  2. Changed "User Preferences" to:

{    "prodigyConfig": {
        "url": "http://0.0.0.0:8888"
    }
}
  1. Now put:
PRODIGY_CONFIG_OVERRIDES='{"host": "0.0.0.0"}' PRODIGY_PORT=8888 python3 -m prodigy ner.manual ner_test blank:en train.jsonl --label label1,label2,label3
✨  Starting the web server at http://0.0.0.0:8888 ...
Open the app in your browser and start annotating!
  1. "Open Prodigy" once again brings me back to "localhost refused to connect".

PRODIGY_CONFIG_OVERRIDES='{"host": "*"}' PRODIGY_PORT=8888 python3 -m prodigy ner.manual ner_test blank:en train.jsonl --label label1,label2,label3
✨  Starting the web server at http://*:8888 ...
Open the app in your browser and start annotating!
  1. Once again "localhost refused to connect" on "Open Prodigy" in Launcher.

I'm completely dumbfounded, frustrated, and feeling kind of hopeless.

Hi @sangersteel , thanks for providing these detailed steps. Let's try debugging a bit further.

Sometimes in GCE it performs redirection to https by default. So here are some thing you can try:

  • On #4, you might want to use https instead of http.
  • I'm not sure if there are firewall rules that you set when creating the GCE virtual machine. That might be affecting how you connect to localhost.
  • Again on #4, it might help to specify the exact address, like 0.0.0.0 or 127.0.0.1.

For now, I don't think there's no need to override the host where Prodigy is running (aside from the port). It's just that the Jupyterlab extension can't "see" this process.

Hey there @ljvmiranda921 , I'm having trouble following your step about using https instead of http because even if I change the prodigyConfig URL, when I start the webserver it still defaults to http:

✨  Starting the web server at http://0.0.0.0:8888 ...
Open the app in your browser and start annotating!

With the current instructions you've given me, besides the issue I just brought up, I'm getting a "0.0.0.0 took too long to respond " error. If I try to go to http://0.0.0.0:8888 I get a 403 Forbidden whether I use http or https for #4. If I try to go to https://0.0.0.0:8888 it says it took too long to respond whether I use http or https for #4.

Still stuck.

Sorry for just getting back into this. We have a few hunches for this behaviour:

  • It's possible that Google's managed notebook (Vertex AI) does some port switching for security. We can replicate a similar behaviour in VS Code, where even if you configured Prodigy to run on 8080, VSCode will silently add a tunnel to 8081 even if the terminal displays it as so.
  • This another suggestion might require a bit more setup, but we recommend trying it out in a "normal VM" (e.g., Google Cloud Engine instead of Google Vertex Workbench) to run Prodigy, if possible. You might need to install jupyer and jupyterlab, but still, you have total control of your environment.

Unfortunately, when managed services are involved, it's hard to get environment parity to something we can reproduce.