When you set PRODIGY_LOGGING=basic
, is there anything in the logs that looks relevant? If you end up with no examples in the stream, this typically means that all examples were skipped, either because they're already annotated in the dataset, or because they're invalid for some othe reason (invalid JSON, no "text"
).
Also double-check that your stream generator doesn't get stuck in an infinite loop or similar by accident (bugs here can sometimes be pretty subtle), and if you're using PyTorch, check that PyTorch doesn't spawn multiple threads under the hood. (If it does, try moving the stream processing logic into a separate Python script and pip the JSON output forward, so you can ensure it runs in the main thread.)
Just a quick note, it's possible that the update
callback will end up tricky to implement with the large transformer models: the updating itself can be a bit slow (especially on CPU), and the models usually expect larger batch sizes and don't always respond well to small individual batch updates.
So it might turn out that a better workflow for transformers in the loop is to annotate ~100 examples, train, load the new model in, annotate another ~100 examples, and so on.
We already have a nightly pre-release out that you can try: ✨ Prodigy nightly: spaCy v3 support, UI for overlapping spans & more We're hoping to have the stable release ready within the next few weeks – the main feature holding it up was improved support in spaCy v3 for binary annotations and learning from "negative examples (see this PR).