✨ Prodigy nightly: spaCy v3 support, UI for overlapping spans, improved feeds & more

As mentioned in this thread, now that spaCy v3.0 is out we can start testing the new version of Prodigy that integrates with it :tada:

If you want to be among the first to test new cutting-edge features, you can now join the Prodigy nightly program! It's open to all users of the latest version, v1.10.x. The download is set up through our online shop, so once you're added, you'll receive a free "order" of the nightly release. Whenever a new nightly is available, you'll receive an email notification. Feel free to post any questions in this thread.

:point_right: Apply for the nightly program here :point_left:

Disclaimer: Keep in mind that it's a pre-release, so it can include breaking changes and is not designed for production use. Even though none of the changes affect the database, it's always a good idea to back up your work. Also, don't forget to use a fresh virtual environment!

New features included in the release

New train, train-curve and data-to-spacy commands for spaCy v3

All training and data conversion workflows have been updated to support the new spaCy v3.0. The training commands can now train multiple components at the same time and will take care of merging annotations on the same input data. You can now also specify different evaluation datasets per task using the eval: prefix – for instance --ner my_train_set,eval:my_eval_set will train the named entity recognizer on my_train_set and evaluate it on my_eval_set. If no evaluation dataset is provided, a percentage of the examples (for the given component) is held back automatically.

data-to-spacy now takes an output directory and generates all data you need to train a pipeline with spaCy, including the data (in spaCy's efficient binary format), a config and even the data to initialize the labels, which can significantly speed up the training process. So once you're ready to scale up your training

The train-curve command now also supports multiple components and lets you set --show-plot to print a visual representation of the curve, in your terminal (requires the plotext library to be installed).

Under the hood, Prodigy now includes custom corpus readers for loading and merging annotations from Prodigy datasets. Those will be added to the training config when you train with Prodigy, which makes it really easy to run quick experiments, without having to export your data. The prodigy spacy-config command generates this config, so you can also use it as a standalone command if you want to. (Pro tip: setting the output to - will write to stdout, and spacy train supports reading configs from stdin. So you can also just pipe the config forward to spaCy if you want!)

:open_book: For documentation and available arguments, run the command name with --help, e.g. prodigy train --help.

spans.manual and UI for annotating overlapping and nested spans

We've also shipped a preview of the new span annotation UI that lets you label any number of potentially overlapping and nested spans. You can use it via the spans.manual recipe. (It's separate from the NER labelling workflows because the data you create with it couldn't be used to train a regular named entity recognizer, because those model implementations typically predict single token-based tags. But in the future, spaCy will provide a SpanCategorizer component for predicting arbitrary spans!).

New feed implementation to better support multi-user sessions

This is mostly internals, but the nightly also ships with a new implementation of the internal logic that orchestrates the streams of examples, and improves support for multi-user sessions. It's enabled by default (and can be disabled by setting PRODIGY_LEGACY=1 if you end up having problems). So if you're running workflows with multi-user sessions, give it a go and let us know whether everything works as expected.

Future updates and todos

  • Include per-label scores in training logs. This is no problem in spaCy v3 because the logging is fully customizable – the main question is whether this feature should live in spaCy or Prodigy.
  • Some cool new workflows using beam search for NER and transformer-based pipelines – this is all much easier in spaCy v3, so there's a lot to explore. Some ideas include: visualize multiple layers of the beam with heatmap-style colours so you can see the possible scored predictions, automatically include high-confidence predictions in dataset (with occasional checks to see if the threshold is okay)... Maybe you have some cool ideas as well! :smiley: