Migration from spaCy 2.3 to 3.x + Annotating data in prodigy

I am finally migrating my project away from spaCy 2.3 to spaCy 3.x, and I want to clear up some of my thoughts before getting started. Sorry about the long post, and thank you for taking the time to read.

spaCy 2.3 Model:

I used the en_core_web_lg model as a base model and using prodigy.correct recipe added few other labels to the model. The model was trained on the annotated data using prodigy.train. This model is used in the pipeline below.

Note: the text was tokenized with the default tokenizer that came with the en_core_web_lg model, and not with the custom tokenizer mentioned below.

Note: I have about 2k of annotated data.

The Current Pipeline:

pipeline

  • custom tokenizer: The custom tokenizer is implemented similar to how @adriane mentioned here. I added a check for the type of input to handle the cases appropriately. I needed a custom tokenizer for two purposes:
    1. add metadata to the doc object using a doc extension (doc._.meta) so that that metadata could be used by the pipeline components downstream. The custom tokenizer accepts both str and dict as inputs and returns a doc object with/without the metadata extension applied to the doc object based on the given input.
    2. add a few custom rules to tokenize the data given our use case.
  • tagger and parser: same as the en_core_web_lg model. This was not trained during prodigy.train, only the ner part was trained.
  • metadata: uses the doc._.metaattribute (a dict of terms) and a PhraseMatcher to match given texts and assign entity labels to them. The reason for putting this component before ner is based on the discussion with @ines here.
  • ner: this was the component of en_core_web_lg that was trained with prodigy.train.
  • entity ruler: this component adds support for patterns loaded from .jsonl files. Does not overwrite the label set by ner. Mostly used as a fallback in case ner misses something.
  • other custom components: there are a few other custom components that use regex, matcher and phrase matcher based matching to label other entities.

This model (trained with only 2k) has worked well (even though the model wasn't trained using the same tokenization rules as used in the pipeline).

Questions with regards to migration

Steps I think I will need to successfully migrate from spaCy 2.3 to 3.x:

  1. Ensure data integrity
    • From what I understand, the input data format for prodigy is the same as before, so I should be able to use the same annotated data used to train spaCy 2.3 model using prodigy.train to train spacy 3.x model using prodigy.train?
    • As I have a custom tokenizer now, how do I update/correct the tokenization of the previously annotated texts? As the tokenization rules have changed, would I need to reannotate all the texts?
      • How would I incorporate the custom tokenizer into the prodigy workflow (@adriane mentioned here that serialization of the custom tokenizer that I have implemented is not going to work).
  2. Annotating the data
    • If I have to reannotate the text, should I start with a base model like "en_core_web_trf" with my custom tokenizer (I am assuming I have to load the model, add the custom tokenizer, save it to disk and then reload the model when using prodigy.correct) to get benefits of model in the loop?
  3. Training the model
    • Now that I have the annotated data with my entities of choice and the tokens split based on my custom tokenization rules. I can train the ner component using prodigy.train. I should also be able to train the tagger and parser with the same data? Should I train on top of an existing model or a blank model?
  4. Plugin the model into the pipeline above, and add spaCy 3.x specific configurations.

Note: I have about 25k samples ready to be annotated.

Thank you for your time, and would appreciate it if you can help clarify the questions I have above. Super excited to complete this migration.
:pray:t4:

Hi! It should hopefully be pretty easy to make the switch :slightly_smiling_face: If you haven't seen it yet, the migration guide has an overview of the most relevant changes from spaCy v2.x to 3.x, which in your case will mostly be adjusting how your custom pipeline components are registered: https://spacy.io/usage/v3#migrating

Yes, that's correct.

This really depends on the types of changes and whether the new tokenization rules actually produce invalid annotations (e.g. entity spans). If your custom tokenizer changes the token boundaries in a way that your entity spans don't map to valid start/end tokens anymore, then yes, you'd probably need to re-annotate – but that's also pretty unlikely? You can easily test this by running your new tokenizer over your annotations and calling doc.char_span(start, end) with the "start" and "end" offsets of the annotated spans. If it returns None, the span doesn't map to valid token boundaries. Otherwise, the span is compatible with your tokenization. If there are only one or two cases that are incompatible, you can easily fix those by hand or exclude them.

The normal workflow would be to create a spaCy pipeline with your custom tokenizer, package it with spacy package, install it in your Prodigy environment and then use its name as the input model in Prodigy.

During annotation, you could also edit the recipe script to hack in your custom tokenizer, but this easily gets messy because you also want to make sure that you have the same tokenizer available during training. So looking at the thread, it might just be easiest to add some simple serialization methods to your custom tokenizer and then add it to your config: https://spacy.io/usage/linguistic-features#custom-tokenizer-training

This depends on what your categories are: if the en_core_web models already predict some of them, then yes, you could use them to help you pre-annotate the data so you have less work to do. And you should also use the same tokenizer during annotation, because otherwise, you may end up creating annotations that the model can't learn from (because it never produces these tokens during training or at runtime). But as I said above, double-check your annotations first because you might not need to re-annotate anything at all.

Are you sure you actually want to train the tagger and parser? You'd need a lot of annotated data for that so unless there's a lot you want to improve on your custom data, I'd probably leave the tagger and parser alone.

This depends on what you need. If you just want to train a new entity recognizer, you can use a blank model – if you want to keep the previous components, you can train from an existing model. However, when updating an existing entity recognizer, make sure that your labels don't clash with what the model already predicts, otherwise you'll easily get worse results.

When you run prodigy train and prodigy data-to-spacy, Prodigy will auto-generate a training config for you that you can inspect (and edit if you need to). You can also use the quickstart generator to create a config, update it with your custom components and then provide it to Prodigy during training using the --config option.

For the use case and pipeline you describe, you probably want your config to specify the tokenizer via your custom function, the tagger and parser sourced from an existing pipeline like en_core_web_lg or en_core_web_trf and frozen (so they're not updated) + your custom components and the ner component (either sourced from an existing pipeline or initialised from scratch).

1 Like