Edit (2021-08-12): Prodigy v1.11 is out now See the release notes here: Changelog · Prodigy · An annotation tool for AI, Machine Learning & NLP
As mentioned in this thread, now that spaCy v3.0 is out we can start testing the new version of Prodigy that integrates with it
If you want to be among the first to test new cutting-edge features, you can now join the Prodigy nightly program! It's open to all users of the latest version, v1.10.x. The download is set up through our online shop, so once you're added, you'll receive a free "order" of the nightly release. Whenever a new nightly is available, you'll receive an email notification. Feel free to post any questions in this thread.
Apply for the nightly program here
Disclaimer: Keep in mind that it's a pre-release, so it can include breaking changes and is not designed for production use. Even though none of the changes affect the database, it's always a good idea to back up your work. Also, don't forget to use a fresh virtual environment!
New features included in the release
New train
, train-curve
and data-to-spacy
commands for spaCy v3
All training and data conversion workflows have been updated to support the new spaCy v3.0. The training commands can now train multiple components at the same time and will take care of merging annotations on the same input data. You can now also specify different evaluation datasets per task using the eval:
prefix – for instance --ner my_train_set,eval:my_eval_set
will train the named entity recognizer on my_train_set
and evaluate it on my_eval_set
. If no evaluation dataset is provided, a percentage of the examples (for the given component) is held back automatically.
data-to-spacy
now takes an output directory and generates all data you need to train a pipeline with spaCy, including the data (in spaCy's efficient binary format), a config and even the data to initialize the labels, which can significantly speed up the training process. So once you're ready to scale up your training
The train-curve
command now also supports multiple components and lets you set --show-plot
to print a visual representation of the curve, in your terminal (requires the plotext
library to be installed).
Under the hood, Prodigy now includes custom corpus readers for loading and merging annotations from Prodigy datasets. Those will be added to the training config when you train with Prodigy, which makes it really easy to run quick experiments, without having to export your data. The prodigy spacy-config
command generates this config, so you can also use it as a standalone command if you want to. (Pro tip: setting the output to -
will write to stdout, and spacy train
supports reading configs from stdin. So you can also just pipe the config forward to spaCy if you want!)
For documentation and available arguments, run the command name with
--help
, e.g.prodigy train --help
.
spans.manual
and UI for annotating overlapping and nested spans
We've also shipped a preview of the new span annotation UI that lets you label any number of potentially overlapping and nested spans. You can use it via the spans.manual
recipe. (It's separate from the NER labelling workflows because the data you create with it couldn't be used to train a regular named entity recognizer, because those model implementations typically predict single token-based tags. But in the future, spaCy will provide a SpanCategorizer
component for predicting arbitrary spans!).
Future updates and todos
- Include per-label scores in training logs. This is no problem in spaCy v3 because the logging is fully customizable – the main question is whether this feature should live in spaCy or Prodigy.
- Some cool new workflows using beam search for NER and transformer-based pipelines – this is all much easier in spaCy v3, so there's a lot to explore. Some ideas include: visualize multiple layers of the beam with heatmap-style colours so you can see the possible scored predictions, automatically include high-confidence predictions in dataset (with occasional checks to see if the threshold is okay)... Maybe you have some cool ideas as well!