Prodigy hangs when streaming into rel.manual

FourthPartyAI · December 7, 2024, 1:31am

I have StreamManager that loads data from a Prodigy dataset backed by Postgres. It works fine for spans, but when I switch over to relationships I get odd behavior...

In both cases I have a wrapper recipe, the spans example builds the stream and then returns:
prodigy.recipes.spans.manual(dataset, spacy_model, source=wrapped_stream, label=labels)

The relations example works the same way but returns:
prodigy.recipes.rel.manual(dataset, spacy_model, source=wrapped_stream, label=labels, span_label=span_labels, wrap=wrap)

In the spans case everything works as expected, I see startup on localhost:8080 and I can interact with the streaming data. However, in the relationships case it is completely silent, there is no startup message and I cannot access the interface on localhost:8080. In the relationships case, I can instead directly return:

        "view_id": "relations",
        "dataset": dataset,
        "stream": wrapped_stream,
        "exclude": [],
        "config": {
            "lang": "en",
            "labels": labels,
            "relations_span_labels": span_labels,
            "exclude_by": "input",
            "wrap_relations": wrap,
            "custom_theme": {"cardMaxWidth": "90%"},
            "hide_relations_arrow": False,
            "auto_count_stream": True,
        }
    }

In this case I get the startup message and can open localhost:8080 but it just hangs saying Loading Data...

Looking in the StreamManager logs, I see that examples are being pulled and yielded as expected. A sample is:

{
    'text': 'This is a test dataset. This is the first page. Here is a thing you can tag.',
    'spans': [],
    'meta': {'source': 'test'},
    '_input_hash': 8634462510954551304,
    '_task_hash': -2938045457543123250,
    'relations': [],
    '_view_id': 'relations'
}

(for comparison, the example in the spans case is the same except without the relations field).

It does seem like there is a lot of peewee chatter when I startup the relations example, but it doesn't seem appreciably different from the spans startup logging

Defaulted container "prodigy-rel" out of: prodigy-rel, check-db-ready (init)
MANUAL REL LABEL
Database ps_db already exists
ANNOTATED DATASET NAME bio_events_annotated_rel
INPUT DATASET NAME bio_events_input_rel
INFO:__main__:Updated meta for bio_events_input_rel: {'input': True, 'annotated': False, 'spans': False, 'rel': True, 'annotated_dataset': 'bio_events_annotated_rel', 'created': datetime.datetime(2024, 12, 7, 1, 20, 7)}
INFO:__main__:Available datasets: ['bio_events_annotated_rel', 'bio_events_annotated_spans', 'bio_events_input_spans', 'bio_events_input_rel']
INFO:__main__:Updated meta for bio_events_annotated_rel: {'input': False, 'annotated': True, 'spans': False, 'rel': True, 'input_dataset': 'bio_events_input_rel', 'created': datetime.datetime(2024, 12, 6, 19, 27, 16)}
INFO:__main__:Available datasets: ['bio_events_annotated_rel', 'bio_events_annotated_spans', 'bio_events_input_spans', 'bio_events_input_rel']
DEBUG:custom_code:Starting recipe execution
DEBUG:custom_code:RECPIE CONFIG
DEBUG:custom_code:  DATASET: bio_events_annotated_rel
DEBUG:custom_code:  MODEL: blank:en
DEBUG:custom_code:  SOURCE: stream_manager.get_stream
DEBUG:custom_code:  SOURCE ARGS: {"dataset_name": "bio_events_input_rel","control_port": 8092}
DEBUG:custom_code:  LABELS: ['THEME', 'CAUSE']
DEBUG:custom_code:  SPAN LABELS: ['GGP', 'GENE_EXPR', 'TRANSCR', 'PROT_CAT', 'PHOSPH', 'LOC', 'BIND', 'REG', 'REG+', 'REG-']
DEBUG:custom_code:  WRAP: True
DEBUG:custom_code:Processing source: {source}
INFO:stream_manager.stream_manager:Using control port: 8092
DEBUG:peewee:('SELECT tablename FROM pg_catalog.pg_tables WHERE schemaname = %s ORDER BY tablename', ('public',))
DEBUG:peewee:('CREATE TABLE IF NOT EXISTS "example" ("id" SERIAL NOT NULL PRIMARY KEY, "input_hash" BIGINT NOT NULL, "content" BYTEA NOT NULL, "task_hash" BIGINT NOT NULL)', [])
DEBUG:peewee:('CREATE TABLE IF NOT EXISTS "link" ("id" SERIAL NOT NULL PRIMARY KEY, "example_id" INTEGER NOT NULL, "dataset_id" INTEGER NOT NULL, FOREIGN KEY ("example_id") REFERENCES "example" ("id"), FOREIGN KEY ("dataset_id") REFERENCES "dataset" ("id"))', [])
DEBUG:peewee:('CREATE INDEX IF NOT EXISTS "link_example_id" ON "link" ("example_id")', [])
DEBUG:peewee:('CREATE INDEX IF NOT EXISTS "link_dataset_id" ON "link" ("dataset_id")', [])
DEBUG:peewee:('CREATE TABLE IF NOT EXISTS "structured_input" ("id" SERIAL NOT NULL PRIMARY KEY, "hash" BIGINT NOT NULL, "content" BYTEA NOT NULL, "created" TIMESTAMP NOT NULL)', [])
DEBUG:peewee:('CREATE UNIQUE INDEX IF NOT EXISTS "structuredinputmodel_hash" ON "structured_input" ("hash")', [])
DEBUG:peewee:('CREATE TABLE IF NOT EXISTS "structured_example" ("id" SERIAL NOT NULL PRIMARY KEY, "task_hash" BIGINT NOT NULL, "answer" VARCHAR(6) NOT NULL, "content" BYTEA NOT NULL, "input_id" INTEGER NOT NULL, "created" TIMESTAMP NOT NULL, FOREIGN KEY ("input_id") REFERENCES "structured_input" ("id"))', [])
DEBUG:peewee:('CREATE INDEX IF NOT EXISTS "structuredexamplemodel_task_hash" ON "structured_example" ("task_hash")', [])
DEBUG:peewee:('CREATE INDEX IF NOT EXISTS "structuredexamplemodel_answer" ON "structured_example" ("answer")', [])
DEBUG:peewee:('CREATE INDEX IF NOT EXISTS "structuredexamplemodel_input_id" ON "structured_example" ("input_id")', [])
DEBUG:peewee:('CREATE TABLE IF NOT EXISTS "structured_link" ("id" SERIAL NOT NULL PRIMARY KEY, "example_id" INTEGER NOT NULL, "dataset_id" INTEGER NOT NULL, "session_id" VARCHAR(255) NOT NULL, "created" TIMESTAMP NOT NULL, FOREIGN KEY ("example_id") REFERENCES "structured_example" ("id"), FOREIGN KEY ("dataset_id") REFERENCES "dataset" ("id"))', [])
DEBUG:peewee:('CREATE INDEX IF NOT EXISTS "structuredlinkmodel_example_id" ON "structured_link" ("example_id")', [])
DEBUG:peewee:('CREATE INDEX IF NOT EXISTS "structuredlinkmodel_dataset_id" ON "structured_link" ("dataset_id")', [])
DEBUG:peewee:('SELECT "t1"."id", "t1"."name", "t1"."created", "t1"."meta", "t1"."session" FROM "dataset" AS "t1" WHERE ("t1"."name" = %s) LIMIT %s OFFSET %s', ['bio_events_input_rel', 1, 0])
DEBUG:peewee:('SELECT "t1"."id", "t1"."name", "t1"."created", "t1"."meta", "t1"."session" FROM "dataset" AS "t1" WHERE ("t1"."name" = %s) LIMIT %s OFFSET %s', ['bio_events_input_rel', 1, 0])
INFO:stream_manager.stream_manager:STARTING CONTROL SERVER WITH MANAGER DB
INFO:stream_manager.stream_manager:  DETECTED DBNAME: ps_db
INFO:stream_manager.stream_manager:  DETECTED CONN PARAMS: {'host': 'postgres.default', 'user': 'ps_user', 'password': 'ps_pass', 'port': 5432}
INFO:stream_manager.stream_manager:Control server started on port 8092
DEBUG:asyncio:Using selector: EpollSelector

In the console logs I observe a difference. The spans example doesn't show any errors, but the relationships example gives:

POST http://localhost:8080/get_session_questions
Uncaught (in promise) Error: SyntaxError: JSON.parse: unexpected character at line 1 column 1 of the JSON data

Since I don't see any errors from prodigy in the logs, I'm at a loss for where to even begin searching for the source of the problem. The javascript console logs give some indication that there is malformed json somewhere, but the data coming out of the stream seems to be ok (I've confirmed that its JSON serializable).

FourthPartyAI · December 7, 2024, 2:28am

Update: If I reduce batch_size from 10 to 1 in prodigy.json, then the console error occurs much more rapidly. It seems like perhaps the system was waiting for 10 items to come out of the stream before proceeding (for testing it only has 2). But in the end the page still crashes with the above console message.

FourthPartyAI · December 7, 2024, 6:35am

So I can resolve the JSON error and get the page up and running. It turns out I needed to tokenize the data coming out of the stream and include it in the 'tokens' field of the example. I was expecting this to be done by the model passed to prodigy.recipes.rel.manual.

Curiously, while I can get the page to run, I need to directly respond with the dictionary above, I can't first make a call to prodigy.recipes.rel.manual and return that dictionary...If I do that then the server is never launched. I can see the prodigy command with ps aux | grep prodigy but the port isn't bound.

ines · December 9, 2024, 12:28pm

Thanks for the detailed report! This is definitely curious and I probably need to take some time to try and reproduce it. My first thought was that the core of the problem might be something related to the nested generators. Is there any chance that your StreamManager starts multiple threads?

Where exactly does it error in the code? I wonder if something happens before the recipe preprocessing runs because preprocess_stream in rel.manual should definitely take care of adding the required "tokens" and constructing the example, so it's very mysterious that this fails

FourthPartyAI · December 9, 2024, 5:37pm

Is there any chance that your StreamManager starts multiple threads?

The StreamGenerator does spawn an additional thread. This is for a FastAPI control server. I don't think its the root of the issue since this works fine in the spans.manual recipe. The threads are not involved in the generators.

Where exactly does it error in the code? I wonder if something happens before the recipe preprocessing runs because preprocess_stream in rel.manual should definitely take care of adding the required "tokens" and constructing the example, so it's very mysterious that this fails

The error appears in the Javascript console in my web browser at get_sessions_query in bundle.js there are no errors in the Python logs. Furthermore, this only occurs if I launch prodigy using a manually constructed dictionary. If I call prodigy.recipies.rel.manual then the server never launches, it just hangs. Debug logs only show the usual peewee chatter. The stacktrace in the console is opaque to say the least:

Uncaught (in promise) Error: SyntaxError: JSON.parse: unexpected character at line 1 column 1 of the JSON data
    throwError http://localhost:8082/bundle.js:308
    defaultError http://localhost:8082/bundle.js:308
    updateQueue http://localhost:8082/bundle.js:308
    promise callback*bundle.js/updateQueue/< http://localhost:8082/bundle.js:308
    getProject http://localhost:8082/bundle.js:308
    promise callback*bundle.js/getProject/< http://localhost:8082/bundle.js:308
    _Main http://localhost:8082/bundle.js:308
    Ii http://localhost:8082/bundle.js:49
    hj http://localhost:8082/bundle.js:49
    Vk http://localhost:8082/bundle.js:49
    Uk http://localhost:8082/bundle.js:49
    Tk http://localhost:8082/bundle.js:49
    Ik http://localhost:8082/bundle.js:49
    Gk http://localhost:8082/bundle.js:49
    J2 http://localhost:8082/bundle.js:41
    R2 http://localhost:8082/bundle.js:41
    js http://localhost:8082/bundle.js:41
    js http://localhost:8082/bundle.js:41
    __require http://localhost:8082/bundle.js:1
    <anonymous> http://localhost:8082/bundle.js:310
bundle.js:308:135045
    throwError http://localhost:8082/bundle.js:308
    defaultError http://localhost:8082/bundle.js:308
    updateQueue http://localhost:8082/bundle.js:308
    (Async: promise callback)
    updateQueue http://localhost:8082/bundle.js:308
    getProject http://localhost:8082/bundle.js:308
    (Async: promise callback)
    getProject http://localhost:8082/bundle.js:308
    _Main http://localhost:8082/bundle.js:308
    Ii http://localhost:8082/bundle.js:49
    hj http://localhost:8082/bundle.js:49
    Vk http://localhost:8082/bundle.js:49
    Uk http://localhost:8082/bundle.js:49
    Tk http://localhost:8082/bundle.js:49
    Ik http://localhost:8082/bundle.js:49
    Gk http://localhost:8082/bundle.js:49
    J2 http://localhost:8082/bundle.js:41
    R2 http://localhost:8082/bundle.js:41
    (Async: EventHandlerNonNull)
    js http://localhost:8082/bundle.js:41
    js http://localhost:8082/bundle.js:41
    __require http://localhost:8082/bundle.js:1
    <anonymous> http://localhost:8082/bundle.js:310

ines · December 10, 2024, 11:50am

Okay, so it definitely seems like the web app is receiving malformed JSON, or no JSON at all while updating the queue and fetching a batch of examples. Which is confusing, because there seems to be no corresponding back-end error.

I wonder if this could still be relevant and there's maybe a race condition, which is why you don't see it in spans.manual. Ultimately, the recipes are quite similar, only that rel.manual performs slightly more computation (constructing matchers, matching etc.) so this could make a difference here. And there may be a back-end error, which is swallowed because it occurs in the other thread.

Because what you describe does remind me of a case where PyTorch was spawning threads during prediction in the stream, and it's the only time I remember seeing this sort of very curious behaviour.

Is there a way you could try it without multi-threading, just to test if that changes anything?

Topic		Replies	Views
Custom recipe - stream loops through all examples. usage , custom	6	394	May 4, 2022
Prodigy hangs on "FILTER: Filtering out empty examples for key 'text'" usage , solved	2	576	June 10, 2018
Custom relation recipe usage , front-end , relations	2	365	December 27, 2021
First Project Data won't load to prodigy ner , project	5	330	August 16, 2023
rel.manual with pre-labelled spans displays message "No Tasks Available" usage , solved , relations	1	410	April 10, 2022

Prodigy hangs when streaming into rel.manual

Related topics