Prodigy v1.10 is out now And it includes a bunch of cool new features, including dependency and relation annotation, audio and video annotation, an enhanced manual image UI, more settings for NER annotation, including character-based highlighting and a cool example workflow for creating training data for fine-tuning transformer models, new recipe callbacks for validating answers at runtime and for modifying examples before the database, new settings for UI customisation, and lots more! The new version also updates Prodigy to the latest spaCy v2.3, so you can use the new models for Chinese, Japanese, Danish, Polish and Romanian. You can see the full Prodigy v1.10 changelog here: Changelog ยท Prodigy ยท An annotation tool for AI, Machine Learning & NLP
I've also recorded a little video that walks you through the most exciting new features:
Twitter thread:
Thanks to everyone who helped beta test the new features โ your feedback was super valuable
Thanks all the team for this great new update and features!
I tried to install the new version in Windows, but it seems there is a problem with spaCy dependency: it requires the previous spaCy version, instead of 2.3 From wheel metadata: Requires-Dist: spacy (<2.3.0,>=2.2.3)
At the moment, a workaround is to install Prodigy first, then forcing spaCy==2.3, ignoring the incompatibility warning.
Thanks for all your great work on this release. I've been waiting for the before_db callback and I'm delighted it's been implemented. I've spotted a little typo in the code example for it here. It should be .startswith(...):
...
if eg["image"].startwith("data:") and "path" in eg:
# ^ missing an 's' here
There's one other thing I noticed in the textcat.teach docs that wasn't so sure about, it might be new or I just never noticed. There's a --long-text flag in the example, but no description of it and it doesn't exist in the prodigy-recipes repo.
Thanks, appreciate the attention to detail Fixed and should be live in a second.
Ah, sorry for the confusion, I think it ended up this way because we weren't sure whether we'd want to deprecate that feature or not. It's always been a bit experimental and only available in the binary workflow.
The idea is that if you have very long texts, you're often still training your model on shorter fragments like sentences and then average over the predictions to get the score for a whole document. So there's often no real benefit in annotating the whole long documents at once, it just takes much longer and you only get one label per document. So the idea of the --long-text mode is to show you a highlighted sentence at the time and collect feedback on that, focusing on the most uncertain scores. If you're training with Prodigy and setting the --binary flag, you should be able to update the model from those annotations. However, I'm not sure it's actually better than a more transparent approach where you split up the sentences beforehand.