I am finally migrating my project away from spaCy 2.3 to spaCy 3.x, and I want to clear up some of my thoughts before getting started. Sorry about the long post, and thank you for taking the time to read.
spaCy 2.3 Model:
I used the en_core_web_lg
model as a base model and using prodigy.correct
recipe added few other labels to the model. The model was trained on the annotated data using prodigy.train
. This model is used in the pipeline below.
Note: the text was tokenized with the default tokenizer that came with the en_core_web_lg
model, and not with the custom tokenizer mentioned below.
Note: I have about 2k of annotated data.
The Current Pipeline:
-
custom tokenizer: The custom tokenizer is implemented similar to how @adriane mentioned here. I added a check for the type of input to handle the cases appropriately. I needed a custom tokenizer for two purposes:
- add metadata to the doc object using a doc extension (
doc._.meta
) so that that metadata could be used by the pipeline components downstream. The custom tokenizer accepts bothstr
anddict
as inputs and returns a doc object with/without the metadata extension applied to the doc object based on the given input. - add a few custom rules to tokenize the data given our use case.
- add metadata to the doc object using a doc extension (
-
tagger and parser: same as the
en_core_web_lg
model. This was not trained duringprodigy.train
, only thener
part was trained. -
metadata: uses the
doc._.meta
attribute (a dict of terms) and a PhraseMatcher to match given texts and assign entity labels to them. The reason for putting this component beforener
is based on the discussion with @ines here. -
ner: this was the component of
en_core_web_lg
that was trained withprodigy.train
. -
entity ruler: this component adds support for patterns loaded from
.jsonl
files. Does not overwrite the label set byner
. Mostly used as a fallback in casener
misses something. - other custom components: there are a few other custom components that use regex, matcher and phrase matcher based matching to label other entities.
This model (trained with only 2k) has worked well (even though the model wasn't trained using the same tokenization rules as used in the pipeline).
Questions with regards to migration
Steps I think I will need to successfully migrate from spaCy 2.3 to 3.x:
- Ensure data integrity
- From what I understand, the input data format for prodigy is the same as before, so I should be able to use the same annotated data used to train spaCy 2.3 model using
prodigy.train
to train spacy 3.x model usingprodigy.train
? - As I have a custom tokenizer now, how do I update/correct the tokenization of the previously annotated texts? As the tokenization rules have changed, would I need to reannotate all the texts?
- From what I understand, the input data format for prodigy is the same as before, so I should be able to use the same annotated data used to train spaCy 2.3 model using
- Annotating the data
- If I have to reannotate the text, should I start with a base model like "en_core_web_trf" with my custom tokenizer (I am assuming I have to load the model, add the custom tokenizer, save it to disk and then reload the model when using
prodigy.correct
) to get benefits of model in the loop?
- If I have to reannotate the text, should I start with a base model like "en_core_web_trf" with my custom tokenizer (I am assuming I have to load the model, add the custom tokenizer, save it to disk and then reload the model when using
- Training the model
- Now that I have the annotated data with my entities of choice and the tokens split based on my custom tokenization rules. I can train the
ner
component usingprodigy.train
. I should also be able to train thetagger
andparser
with the same data? Should I train on top of an existing model or a blank model?
- Now that I have the annotated data with my entities of choice and the tokens split based on my custom tokenization rules. I can train the
- Plugin the model into the pipeline above, and add spaCy 3.x specific configurations.
Note: I have about 25k samples ready to be annotated.
Thank you for your time, and would appreciate it if you can help clarify the questions I have above. Super excited to complete this migration.