I have StreamManager
that loads data from a Prodigy dataset backed by Postgres. It works fine for spans, but when I switch over to relationships I get odd behavior...
In both cases I have a wrapper recipe, the spans example builds the stream and then returns:
prodigy.recipes.spans.manual(dataset, spacy_model, source=wrapped_stream, label=labels)
The relations example works the same way but returns:
prodigy.recipes.rel.manual(dataset, spacy_model, source=wrapped_stream, label=labels, span_label=span_labels, wrap=wrap)
In the spans case everything works as expected, I see startup on localhost:8080
and I can interact with the streaming data. However, in the relationships case it is completely silent, there is no startup message and I cannot access the interface on localhost:8080
. In the relationships case, I can instead directly return:
"view_id": "relations",
"dataset": dataset,
"stream": wrapped_stream,
"exclude": [],
"config": {
"lang": "en",
"labels": labels,
"relations_span_labels": span_labels,
"exclude_by": "input",
"wrap_relations": wrap,
"custom_theme": {"cardMaxWidth": "90%"},
"hide_relations_arrow": False,
"auto_count_stream": True,
}
}
In this case I get the startup message and can open localhost:8080
but it just hangs saying Loading Data...
Looking in the StreamManager
logs, I see that examples are being pulled and yielded as expected. A sample is:
{
'text': 'This is a test dataset. This is the first page. Here is a thing you can tag.',
'spans': [],
'meta': {'source': 'test'},
'_input_hash': 8634462510954551304,
'_task_hash': -2938045457543123250,
'relations': [],
'_view_id': 'relations'
}
(for comparison, the example in the spans case is the same except without the relations
field).
It does seem like there is a lot of peewee
chatter when I startup the relations example, but it doesn't seem appreciably different from the spans startup logging
Defaulted container "prodigy-rel" out of: prodigy-rel, check-db-ready (init)
MANUAL REL LABEL
Database ps_db already exists
ANNOTATED DATASET NAME bio_events_annotated_rel
INPUT DATASET NAME bio_events_input_rel
INFO:__main__:Updated meta for bio_events_input_rel: {'input': True, 'annotated': False, 'spans': False, 'rel': True, 'annotated_dataset': 'bio_events_annotated_rel', 'created': datetime.datetime(2024, 12, 7, 1, 20, 7)}
INFO:__main__:Available datasets: ['bio_events_annotated_rel', 'bio_events_annotated_spans', 'bio_events_input_spans', 'bio_events_input_rel']
INFO:__main__:Updated meta for bio_events_annotated_rel: {'input': False, 'annotated': True, 'spans': False, 'rel': True, 'input_dataset': 'bio_events_input_rel', 'created': datetime.datetime(2024, 12, 6, 19, 27, 16)}
INFO:__main__:Available datasets: ['bio_events_annotated_rel', 'bio_events_annotated_spans', 'bio_events_input_spans', 'bio_events_input_rel']
DEBUG:custom_code:Starting recipe execution
DEBUG:custom_code:RECPIE CONFIG
DEBUG:custom_code: DATASET: bio_events_annotated_rel
DEBUG:custom_code: MODEL: blank:en
DEBUG:custom_code: SOURCE: stream_manager.get_stream
DEBUG:custom_code: SOURCE ARGS: {"dataset_name": "bio_events_input_rel","control_port": 8092}
DEBUG:custom_code: LABELS: ['THEME', 'CAUSE']
DEBUG:custom_code: SPAN LABELS: ['GGP', 'GENE_EXPR', 'TRANSCR', 'PROT_CAT', 'PHOSPH', 'LOC', 'BIND', 'REG', 'REG+', 'REG-']
DEBUG:custom_code: WRAP: True
DEBUG:custom_code:Processing source: {source}
INFO:stream_manager.stream_manager:Using control port: 8092
DEBUG:peewee:('SELECT tablename FROM pg_catalog.pg_tables WHERE schemaname = %s ORDER BY tablename', ('public',))
DEBUG:peewee:('CREATE TABLE IF NOT EXISTS "example" ("id" SERIAL NOT NULL PRIMARY KEY, "input_hash" BIGINT NOT NULL, "content" BYTEA NOT NULL, "task_hash" BIGINT NOT NULL)', [])
DEBUG:peewee:('CREATE TABLE IF NOT EXISTS "link" ("id" SERIAL NOT NULL PRIMARY KEY, "example_id" INTEGER NOT NULL, "dataset_id" INTEGER NOT NULL, FOREIGN KEY ("example_id") REFERENCES "example" ("id"), FOREIGN KEY ("dataset_id") REFERENCES "dataset" ("id"))', [])
DEBUG:peewee:('CREATE INDEX IF NOT EXISTS "link_example_id" ON "link" ("example_id")', [])
DEBUG:peewee:('CREATE INDEX IF NOT EXISTS "link_dataset_id" ON "link" ("dataset_id")', [])
DEBUG:peewee:('CREATE TABLE IF NOT EXISTS "structured_input" ("id" SERIAL NOT NULL PRIMARY KEY, "hash" BIGINT NOT NULL, "content" BYTEA NOT NULL, "created" TIMESTAMP NOT NULL)', [])
DEBUG:peewee:('CREATE UNIQUE INDEX IF NOT EXISTS "structuredinputmodel_hash" ON "structured_input" ("hash")', [])
DEBUG:peewee:('CREATE TABLE IF NOT EXISTS "structured_example" ("id" SERIAL NOT NULL PRIMARY KEY, "task_hash" BIGINT NOT NULL, "answer" VARCHAR(6) NOT NULL, "content" BYTEA NOT NULL, "input_id" INTEGER NOT NULL, "created" TIMESTAMP NOT NULL, FOREIGN KEY ("input_id") REFERENCES "structured_input" ("id"))', [])
DEBUG:peewee:('CREATE INDEX IF NOT EXISTS "structuredexamplemodel_task_hash" ON "structured_example" ("task_hash")', [])
DEBUG:peewee:('CREATE INDEX IF NOT EXISTS "structuredexamplemodel_answer" ON "structured_example" ("answer")', [])
DEBUG:peewee:('CREATE INDEX IF NOT EXISTS "structuredexamplemodel_input_id" ON "structured_example" ("input_id")', [])
DEBUG:peewee:('CREATE TABLE IF NOT EXISTS "structured_link" ("id" SERIAL NOT NULL PRIMARY KEY, "example_id" INTEGER NOT NULL, "dataset_id" INTEGER NOT NULL, "session_id" VARCHAR(255) NOT NULL, "created" TIMESTAMP NOT NULL, FOREIGN KEY ("example_id") REFERENCES "structured_example" ("id"), FOREIGN KEY ("dataset_id") REFERENCES "dataset" ("id"))', [])
DEBUG:peewee:('CREATE INDEX IF NOT EXISTS "structuredlinkmodel_example_id" ON "structured_link" ("example_id")', [])
DEBUG:peewee:('CREATE INDEX IF NOT EXISTS "structuredlinkmodel_dataset_id" ON "structured_link" ("dataset_id")', [])
DEBUG:peewee:('SELECT "t1"."id", "t1"."name", "t1"."created", "t1"."meta", "t1"."session" FROM "dataset" AS "t1" WHERE ("t1"."name" = %s) LIMIT %s OFFSET %s', ['bio_events_input_rel', 1, 0])
DEBUG:peewee:('SELECT "t1"."id", "t1"."name", "t1"."created", "t1"."meta", "t1"."session" FROM "dataset" AS "t1" WHERE ("t1"."name" = %s) LIMIT %s OFFSET %s', ['bio_events_input_rel', 1, 0])
INFO:stream_manager.stream_manager:STARTING CONTROL SERVER WITH MANAGER DB
INFO:stream_manager.stream_manager: DETECTED DBNAME: ps_db
INFO:stream_manager.stream_manager: DETECTED CONN PARAMS: {'host': 'postgres.default', 'user': 'ps_user', 'password': 'ps_pass', 'port': 5432}
INFO:stream_manager.stream_manager:Control server started on port 8092
DEBUG:asyncio:Using selector: EpollSelector
In the console logs I observe a difference. The spans example doesn't show any errors, but the relationships example gives:
POST http://localhost:8080/get_session_questions
Uncaught (in promise) Error: SyntaxError: JSON.parse: unexpected character at line 1 column 1 of the JSON data
Since I don't see any errors from prodigy in the logs, I'm at a loss for where to even begin searching for the source of the problem. The javascript console logs give some indication that there is malformed json somewhere, but the data coming out of the stream seems to be ok (I've confirmed that its JSON serializable).