✨ Prodigy nightly: spaCy v3 support, UI for overlapping spans & more

Edit (2021-08-12): Prodigy v1.11 is out now :raised_hands: See the release notes here: Changelog · Prodigy · An annotation tool for AI, Machine Learning & NLP

As mentioned in this thread, now that spaCy v3.0 is out we can start testing the new version of Prodigy that integrates with it :tada:

If you want to be among the first to test new cutting-edge features, you can now join the Prodigy nightly program! It's open to all users of the latest version, v1.10.x. The download is set up through our online shop, so once you're added, you'll receive a free "order" of the nightly release. Whenever a new nightly is available, you'll receive an email notification. Feel free to post any questions in this thread.

:point_right: Apply for the nightly program here :point_left:

Disclaimer: Keep in mind that it's a pre-release, so it can include breaking changes and is not designed for production use. Even though none of the changes affect the database, it's always a good idea to back up your work. Also, don't forget to use a fresh virtual environment!

New features included in the release

New train, train-curve and data-to-spacy commands for spaCy v3

All training and data conversion workflows have been updated to support the new spaCy v3.0. The training commands can now train multiple components at the same time and will take care of merging annotations on the same input data. You can now also specify different evaluation datasets per task using the eval: prefix – for instance --ner my_train_set,eval:my_eval_set will train the named entity recognizer on my_train_set and evaluate it on my_eval_set. If no evaluation dataset is provided, a percentage of the examples (for the given component) is held back automatically.

data-to-spacy now takes an output directory and generates all data you need to train a pipeline with spaCy, including the data (in spaCy's efficient binary format), a config and even the data to initialize the labels, which can significantly speed up the training process. So once you're ready to scale up your training

The train-curve command now also supports multiple components and lets you set --show-plot to print a visual representation of the curve, in your terminal (requires the plotext library to be installed).

Under the hood, Prodigy now includes custom corpus readers for loading and merging annotations from Prodigy datasets. Those will be added to the training config when you train with Prodigy, which makes it really easy to run quick experiments, without having to export your data. The prodigy spacy-config command generates this config, so you can also use it as a standalone command if you want to. (Pro tip: setting the output to - will write to stdout, and spacy train supports reading configs from stdin. So you can also just pipe the config forward to spaCy if you want!)

:open_book: For documentation and available arguments, run the command name with --help, e.g. prodigy train --help.

spans.manual and UI for annotating overlapping and nested spans

We've also shipped a preview of the new span annotation UI that lets you label any number of potentially overlapping and nested spans. You can use it via the spans.manual recipe. (It's separate from the NER labelling workflows because the data you create with it couldn't be used to train a regular named entity recognizer, because those model implementations typically predict single token-based tags. But in the future, spaCy will provide a SpanCategorizer component for predicting arbitrary spans!).

Future updates and todos

  • Include per-label scores in training logs. This is no problem in spaCy v3 because the logging is fully customizable – the main question is whether this feature should live in spaCy or Prodigy.
  • Some cool new workflows using beam search for NER and transformer-based pipelines – this is all much easier in spaCy v3, so there's a lot to explore. Some ideas include: visualize multiple layers of the beam with heatmap-style colours so you can see the possible scored predictions, automatically include high-confidence predictions in dataset (with occasional checks to see if the threshold is okay)... Maybe you have some cool ideas as well! :smiley:
12 Likes

Such an amazing tool!!! I'm in love after having tried several others. Plan is now to enable automation of content to be added to the training set and have it running on more juicy linux box, how do I request nightly Linux build? I just received the Mac build.

Thanks, very excited to get started with this. For anyone else getting started with this, If you are encountering any errors, you may first want to install Spacy with their handy wizard:

In my case this solved an error I ran into.

2 Likes

Thanks, that's nice to hear! :blush: The Linux build should be available via your download link. If you're downloading the latest stable version, it's a separate download button you can click. If you're downloading the nightly version, it should be included in the zip you download. If you're having problems, let us know – maybe there's an issue with our online store provider.

Is it at all possible to also export the annotations in the (old) json format? Right now it works great with the binary output but i would like to be able to see the data, which the json format allowed me to do.

Hello! Quick bit of feedback as I'm transitioning a model from a spacy 2-generation prodigy training test to a spacy 3-generation "big kid" model, config and all: It wasn't clear how the config generated by the spacy quickstart, spacy cli fill mechanism, and prodigy data-to-spacy went together. To be honest it was a valuable learning experience to dive into the config system, so perhaps that's by design! But some guidance in the documentation for the next version of prodigy as to a recommended workflow for config generation would be helpful! Mine was 1. spacy quickstart download of a base config. 2. fill config from cli. 3. compare data-to-spacy output config with that filled config and adjust where necessary.

At the moment, data-to-spacy only exports the binary format and we do want to phase out the old JSON format, since it's pretty specific and not really used for anything else. But I definitely see your point about viewing the data.

What you can do at the moment is use spaCy's DocBin to load the binary data and then either call the docs_to_json helper to produce the old JSON format, or just use the Doc objects directly and output the data you need, or visualize them.

from spacy.tokens import DocBin
from spacy.training import docs_to_json

docbin = DocBin().from_disk("./train.spacy")
docs = list(doc_bin.get_docs(nlp.vocab))

# Inspect the annotations as Doc objects
print(doc[0].ents)

# Generate v2-style JSON data
json_data = docs_to_json(docs)

We'll definitely have more detailed docs on this for the stable release! The idea behind data-to-spacy is that it auto-generates you a starter config that you can train with straight away if you want to. It does nothing special or Prodigy-specific – it just sets up the components that Prodigy already knows you want to train (since you're converting data from it). Alternatively, you can also just use the exported .spacy files with your existing config, e.g. by providing the --paths.train and --paths.dev overrides on the CLI.

Thanks! All figure-outable, and it was good to dive into the config and demystify things. Thanks for the great stuff.

Using the new nightly build: Going with the video tutorial on food ingredients NER ...I see that --init-tok2vec is not an optional parameter anymore. Is there another way to specify the tok2vec?

Running prodigy train --ner food_data en_vectors_web_lg --init-tok2vec ./tok2vec_cd8_model289.bin --output ./tmp_model --eval-split 0.2

Thank you!

Yes, you can now specify this all in your config when training your spaCy model: Data formats · spaCy API Documentation Since you can override config settings via the CLI, you can simply do: --paths.init_tok2vec /some/path :slightly_smiling_face:

Ah yes, a simple prodigy train --help goes a long way! I see the closer integration between the spacy config and prodigy training. What was throwing me off was the positional arguments for output path. The overrides needed to go at the end. Thank you! :smile:

2 Likes

Hi,

I just wanted to share that I'm getting an error when trying to install the beta on a mac, which is up to date.

pip install ./prodigy*.whl
ERROR: prodigy-1.10.7-cp36.cp37.cp38-cp36m.cp37m.cp38-macosx_10_14_x86_64.whl is not a supported wheel on this platform.

I know there are workarounds for this, but this may be useful to know.

I'm running macos 11.2.3

Best,

Jacobo

I ran into a similar issue and upgrading pip and the setuptools within a new venv fixed this issue for me. Im on the same OS X version as you


python -m venv .env
source .env/bin/activate

pip install --upgrade pip
pip install -U pip setuptools wheel
2 Likes

Hi,

I'm getting No such file or directory: 'output_model/model-last' when running prodigy train -l en -n my_dataset -m blank:en output_model.

If feels like it is expecting an existing model to exist. After creating output_model/model-last it does work, but I suppose users will expect the folder to be created for them.

Thanks!

Hmm, I can't reproduce this :thinking: It saves out fine for me, even if the top-level directory doesn't exist. (Just tried it with --training.max_epochs 1 --training.max_steps 1 so it saves after the first step/epoch). Which version of spaCy are you running? Maybe this was actually an issue in an older version of spaCy v3?

I can reproduce it by deleting the folder. I'm using prodigy-1.11.0a4 and spacy==3.0.5 on a virtualenv on mac.

Checking the code of spacy, it does look like in the function to_disk you call mkdir on a Path object with value model/model-last without setting parents=True, which defaults to False. (pathlib — Object-oriented filesystem paths — Python 3.9.2 documentation). I think it might be that.

Hi @jorgebastida!

I can actually reproduce your issue with an older version of Prodigy, but this seems to be already fixed since prodigy-1.11.0a5 :slight_smile:

Hi Ines,

I have SpaCy 3.0.5 with prodigy-1.10.7. I am not able to use prodigy. Because I am getting error "ModuleNotFoundError: No module named 'spacy.gold'". I have applied for Prodigy nightly. But I did not get download link. Could you clarify that if prodigy-1.10.7 works with SpaCy 3.0.5 ?
Also, could you let me know how long it take to generate the download link for Prodigy nightly?

FYI:
I had Spay 2.3.5 installed in my system when I installed Prodigy 1.10.7 Spacy automatically uninstalled 2.3.5 and installed 3.0.5.

Thanks,
Debo

The latest stable version of Prodigy (v1.10.x) requires spaCy v2.x and will not work with spaCy v3.x. Prodigy nightly requires spaCy v3.x.

That's strange and definitely shouldn't happen – Prodigy pins its requirements very explicitly. Maybe you ended up in a weird environment state? Sometimes the best solution is to just remove the virtual environment and install again from scratch. Also, which package manager do you use, pip or conda? If it's conda, I wonder if its dependency resolver somehow pulled in spaCy v3 because of something else you had already installed in your environment. Maybe also check your install logs to see if it indicates what caused it to upgrade the version.