Thanks for the report! Prodigy’s database handling is powered by the peewee
module, which should hopefully make this easier to debug.
The model size having an impact is pretty interesting… one possible explanation could be that the database connection times out while the model is loaded, so the subsequent calls fail (which is weird and possibly fixable). To test this, you could try editing prodigy/recipes/ner.py
and moving the calls to the DB
further up in the recipe so that they’re made before the model is loaded:
examples = DB.get_dataset(dataset)
task_hashes = DB.get_task_hashes(dataset)
# load the model and do everything else afterwards
Btw, speaking of Prodigy’s PostgreSQL integration: It might not be relevant to this problem, but maybe you’ll find it useful later on. This thread shows an example of connecting to a remote PostgreSQL database by creating the peewee DB manually, and passing it into Prodigy’s Database
. It also uses the Playhouse extension for peewee, which comes with additional extensions for PostgreSQL.
The upcoming version of Prodigy will include some improvements to the database connection handling, which might also help with this problem. And, finally, we’ve never really been happy with the way ner.make-gold
works (e.g. it requiring a raw dataset and making many passes over the data). So in the upcoming version, the current ner.make-gold
recipe will be replaced with a more convenient version using the ner_manual
interface to create gold-standard data faster by correcting the model’s predictions.
Edit: Forgot to add another debugging tip. In case you haven’t seen it already, you can also run all Prodigy recipes and command with the PRODIGY_LOGGING=basic
or PRODIGY_LOGGING=verbose
environment variable. This will log everything that’s going on, including database stuff.