Training multiple Spacy models in Prodigy

Hi everyone, I've been using Spacy for a year and just started with Prodigy but needs some advice how to label multiple models which I ultimately want to combine into a single Spacy pipeline.

I'm labelling long texts with multiple models including: span categorization to classify paragraphs; text classification to classify sentences within paragraphs; a different span categorization to classify phrases within sentences; NER to identify custom entities within phrases; relations to link some phrases and finally a different relations to link some entities.

Can I just train each of these models separately on the same texts with Prodigy? Will I be able to combine the trained models later into a single Spacy pipeline, or is there some specific thing I need to do now during training to avoid problems going into Spacy?

Hi! During development, I'd definitely recomment running some experiments for the components separately because you might want to try out different label schemes and iterate on the data for each component until you're happy. That's a lot easier if you're focusing on one component at a time.

Later on, you can then train all components together or export data for all components. The prodigy train and data-to-spacy commands will take care of this for you: you'll be able to provide datasets for the different components and Prodigy will merge all annotations on the same input text into one example: https://prodi.gy/docs/recipes#data-to-spacy

One thing to keep in mind for the span categorizer: if you're training multiple components in the same pipeline, you probably want to name them differently and use a different spans_key in the config, so the annotations predicted by the components are stored under different keys in the doc.spans. You can customise this in the config by setting the spans_key in the component: https://spacy.io/api/spancategorizer#config In this case, you probably want to run data-to-spacy twice with a different --config that specifies a different key for each component:

[components.spancat]
spans_key = "key1"

Alternatively, you can also train separate models for each component and then put them all together later by sourcing them in your config. The spacy assemble command lets you put it all together into a single pipeline: https://spacy.io/api/cli#assemble

[components.ner]
source = "/path/to/your_trained_ner_model"

[components.spancat1]
source = "/path/to/your_spancat_model1"
component = "spancat"

[components.spancat1]
source = "/path/to/your_spancat_model2"
component = "spancat"

[components.textcat]
source = "/path/to/your_textcat_model"
1 Like

Hi Ines

Thank you for your thorough response and happy new year! Very helpful info.

I just finished labeling my first dataset and trained a Spacy model on it; I've started writing my prediction script to see how the model performs on real data, and while doing that I found the "sc" key - that looks really useful. [Side note: one of the things I love about Spacy (and Prodigy so far) is how relevant the API is without becoming cumbersome.]

I've been looking for a way to test my model predictions on new data using Prodigy, because the interface would be so much easier than reading JSON output for hundreds of spans. It looks like the teach recipe doesn't work for span categorization. Is there a way to use the Prodigy UI to look at Spacy model output?

Hi @mechanicdj :slight_smile:

First you need to convert your model's output into a JSONL file containing the spans annotation format (c.f. Span Categorization · Prodigy · An annotation tool for AI, Machine Learning & NLP). Always double-check the token IDs and the span values to be sure and to avoid errors.

Once you have a JSONL file, say my_model_output.jsonl, you can now use the db-in command to load your "annotations" and view them from Prodigy:

prodigy db-in new_dataset path/to/my_model_output.jsonl

Then just run Prodigy while referencing new_dataset.

To add to this, you can also use the spans.correct workflow to view the model's preditions in the Prodigy UI: https://prodi.gy/docs/recipes#spans-correct

This also supports a model with a span categorizer trained on multiple span categories, which are then displayed as separate labels. The workflow is typically intended to correct a model's predictions, but you can also just use it to view them and skip through them in the browser :slight_smile:

LJV and Ines

Thank you both for the suggestions. I had seen the spans.correct recipe but the description didn't click for me. Thought it was for creating gold standard (hence seemed risky to play with) but now I understand :slight_smile:

The model output/db import option also seems very simple and doable.

Hi Ines

I've tried this approach and it works fine. Please see below for a couple of feedback points/questions. I'd be interested in getting your feedback if you don't mind.

  1. The default whitespace tokenization isn't correct for my domain, so I created and saved a simple model consisting of a custom tokenizer on top of blank:en. I passed the model to spans.correct (running on a VM) but got errors that Prodigy couldn't find the custom tokenizer. I thought it would be sufficient to pass in the saved model; what else do I need to do on the VM so Prodigy can load my tokenizer model?

  2. I included a patterns file with optionals e.g. {"ORTH": "ABC"}, {"ORTH": "DEF", "OP" = "?"} to match ABC or ABC DEF but it matched both, so I had to remove the label from ABC everywhere it found ABC DEF. Is there a setting to force Prodigy to use non-overlapping spans, or select the longest match from the rules?

  3. While getting familiar with the data with spans.manual or spans.correct, a few times I realized after a few dozen examples I had to go back and fix things I'd already accepted (no longer in the history list). Is there a way to search accepted examples containing e.g. "TOP" so I can update my spans to "ON TOP" and "TOP OF"? A search function (similar to the history list) would be helpful to get a list of examples and just go through them to fix something specific without restarting the entire labelling.

  4. I used spans.manual to label 1000 examples and train a model When when I used spans.correct on the model, Prodigy started with the same 1000 examples. This seems unnecessary as they'd all be correct (the model was trained on them) so a wasted 30 minutes just clicking past them. Then I did an additional 1000 examples (so 2000 in total) and again when I used spans.correct it started with the same 2000 (this time it took an hour to just click through the 2000 before I got to "new predictions"(. Can the data be shuffled so I'm looking at new predictions, or should I do it myself (hence use a different data file each time)?

  5. Correcting short sentences (1-2 lines) which have a common structure (label A on the left, label B in the middle, label C on the right) was tedious doing sentence by sentence. It would be 10x faster if I could glance over 5-10 sentences on the screen (making it a more visual task rather than a reading task) an accept them all at once; or click on one to edit/reject it. This might be a useful feature to speed up labeling for simple cases.

If your tokenizer is implemented using custom code, you'll also need to provide the path to the code to execute. In Prodigy you can to this using the -F flag, e.g. -F tokenizer.py.

By default, the spans.manual recipe will show you all matches, since it supports overlapping spans. That said, you could make a small modification to the recipe to make the pattern matching behave like it does for non-overlapping entities: here, the (first) longest span will be preferred.

You can find the recipe in recipes/spans.py in your Prodigy installation (run prodigy stats) to see the path to your installation. You can then look for allow_overlap=True and set it to False in the call to the PatternMatcher.

One thing to consider here is that Prodigy keeps the latest batch of examples on the client before sending the answers to the server and saving it to the database, to allow easy undoing withour having to reconcile multiple conflicting annotations in the database. So you can only go back one batch because the other examples have already been sent back to the server, saved in the database (or, if you're annotating with a model in the loop, used to update the model).

If you find that you often want to go back further, you can set a larger batch_size and history_size – just keep in mind that those examples have not been sent back to the server yet.

Are you saving the annotations to the same dataset? If you're saving to the same dataset or are using the same dataset with --exclude, you should only be seeing texts that haven't been annotated yet.

You could just make your input multi-sentence documents and maybe separate them with a newline token? You can always retain the original character offsets so it's easy to later split them into sentences again if you need to.