Gotya. Then I think the following steps are what you need.
Step 1: Annotation
Label some data. In my case I've made this example dataset:
{"text": "hi my name is Vincent"}
{"text": "people call me Noa"}
And I've annotated these two examples in Prodigy via:
python -m prodigy ner.manual issue-5923 blank:en examples.jsonl --label name
This saves the NER annotations in a dataset named issue-5923
.
Step 2: Convert
Next, let's prepare the data so that spaCy can use it.
python -m prodigy data-to-spacy ./corpus --ner issue-5923 --eval-split 0.5
This creates a folder that contains a train.spacy
and a dev.spacy
file. These are binary file formats that spaCy expects (they also keep things consistent and lightweight).
Step 3: Config
Before running the train command we need to tell spaCy that we want to run a transformer model. A simple way to set that up is to go to the quick start portion of the spaCy docs and to generate a config with transformers.
This generated the following config for me:
# This is an auto-generated partial config. To use it with 'spacy train'
# you can run spacy init fill-config to auto-fill all default settings:
# python -m spacy init fill-config ./base_config.cfg ./config.cfg
[paths]
train = null
dev = null
vectors = null
[system]
gpu_allocator = "pytorch"
[nlp]
lang = "en"
pipeline = ["transformer","ner"]
batch_size = 128
[components]
[components.transformer]
factory = "transformer"
[components.transformer.model]
@architectures = "spacy-transformers.TransformerModel.v3"
name = "roberta-base"
tokenizer_config = {"use_fast": true}
[components.transformer.model.get_spans]
@span_getters = "spacy-transformers.strided_spans.v1"
window = 128
stride = 96
[components.ner]
factory = "ner"
[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = false
nO = null
[components.ner.model.tok2vec]
@architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 1.0
[components.ner.model.tok2vec.pooling]
@layers = "reduce_mean.v1"
[corpora]
[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 0
[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0
[training]
accumulate_gradient = 3
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
[training.optimizer]
@optimizers = "Adam.v1"
[training.optimizer.learn_rate]
@schedules = "warmup_linear.v1"
warmup_steps = 250
total_steps = 20000
initial_rate = 5e-5
[training.batcher]
@batchers = "spacy.batch_by_padded.v1"
discard_oversize = true
size = 2000
buffer = 256
[initialize]
vectors = ${paths.vectors}
I've saved this configuration into a file called partial.cfg
. As the comments on top of this file confirm though, this is only a partial configuration file. It only contains the parts that define the transformer and the NER output. To make it a complete configuration you'll want to run the following command:
python -m spacy init fill-config partial.cfg full-config.cfg
This will generate a full-config.cfg
file that has the missing pieces filled in.
Step 4: Train
Finally, you'll need to run the train
command from spaCy to train the transformer model.
python -m spacy train full-config.cfg --paths.train ./corpus/train.spacy --paths.dev ./corpus/dev.spacy
Note that I'm manually specifying the --paths.train
and --paths.dev
parameter here. You should be able to connect these parameters with the settings in the full-config.cfg
file. Here's what the output looked like on my machine:
ℹ No output directory provided
ℹ Using CPU
=========================== Initializing pipeline ===========================
[2022-09-13 10:39:51,656] [INFO] Set up nlp object from config
[2022-09-13 10:39:51,664] [INFO] Pipeline: ['transformer', 'ner']
[2022-09-13 10:39:51,667] [INFO] Created vocabulary
[2022-09-13 10:39:51,668] [INFO] Finished initializing nlp object
/home/vincent/Development/prodigy-demos/venv/lib/python3.8/site-packages/torch/amp/autocast_mode.py:198: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling
warnings.warn('User provided device_type of \'cuda\', but CUDA is not available. Disabling')
[2022-09-13 10:40:01,957] [INFO] Initialized pipeline components: ['transformer', 'ner']
✔ Initialized pipeline
============================= Training pipeline =============================
ℹ Pipeline: ['transformer', 'ner']
ℹ Initial learn rate: 0.0
E # LOSS TRANS... LOSS NER ENTS_F ENTS_P ENTS_R SCORE
--- ------ ------------- -------- ------ ------ ------ ------
0 0 7.57 3.52 0.00 0.00 0.00 0.00
200 200 164.92 130.18 100.00 100.00 100.00 1.00
...
You'll notice that I'm getting warnings about the lack of a GPU. That's the main downside of transformer models. Without a heavy compute resource the really take long to train (which is why I personally try to omit them when possible).
Final Notes
You might notice that there's a bunch of commands here. To keep everything consistent it is recommended to use the spaCy projects feature. You can learn more about it on our docs or in this YouTube video.