Transform annotations to match tokenization required for SpanBERT/BERT

Gotya. Then I think the following steps are what you need.

Step 1: Annotation

Label some data. In my case I've made this example dataset:

{"text": "hi my name is Vincent"}
{"text": "people call me Noa"}

And I've annotated these two examples in Prodigy via:

python -m prodigy ner.manual issue-5923 blank:en examples.jsonl --label name

This saves the NER annotations in a dataset named issue-5923.

Step 2: Convert

Next, let's prepare the data so that spaCy can use it.

python -m prodigy data-to-spacy ./corpus --ner issue-5923 --eval-split 0.5

This creates a folder that contains a train.spacy and a dev.spacy file. These are binary file formats that spaCy expects (they also keep things consistent and lightweight).

Step 3: Config

Before running the train command we need to tell spaCy that we want to run a transformer model. A simple way to set that up is to go to the quick start portion of the spaCy docs and to generate a config with transformers.

This generated the following config for me:

# This is an auto-generated partial config. To use it with 'spacy train'
# you can run spacy init fill-config to auto-fill all default settings:
# python -m spacy init fill-config ./base_config.cfg ./config.cfg
[paths]
train = null
dev = null
vectors = null
[system]
gpu_allocator = "pytorch"

[nlp]
lang = "en"
pipeline = ["transformer","ner"]
batch_size = 128

[components]

[components.transformer]
factory = "transformer"

[components.transformer.model]
@architectures = "spacy-transformers.TransformerModel.v3"
name = "roberta-base"
tokenizer_config = {"use_fast": true}

[components.transformer.model.get_spans]
@span_getters = "spacy-transformers.strided_spans.v1"
window = 128
stride = 96

[components.ner]
factory = "ner"

[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = false
nO = null

[components.ner.model.tok2vec]
@architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 1.0

[components.ner.model.tok2vec.pooling]
@layers = "reduce_mean.v1"

[corpora]

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 0

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0

[training]
accumulate_gradient = 3
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"

[training.optimizer]
@optimizers = "Adam.v1"

[training.optimizer.learn_rate]
@schedules = "warmup_linear.v1"
warmup_steps = 250
total_steps = 20000
initial_rate = 5e-5

[training.batcher]
@batchers = "spacy.batch_by_padded.v1"
discard_oversize = true
size = 2000
buffer = 256

[initialize]
vectors = ${paths.vectors}

I've saved this configuration into a file called partial.cfg. As the comments on top of this file confirm though, this is only a partial configuration file. It only contains the parts that define the transformer and the NER output. To make it a complete configuration you'll want to run the following command:

python -m spacy init fill-config partial.cfg full-config.cfg

This will generate a full-config.cfg file that has the missing pieces filled in.

Step 4: Train

Finally, you'll need to run the train command from spaCy to train the transformer model.

python -m spacy train full-config.cfg --paths.train ./corpus/train.spacy --paths.dev ./corpus/dev.spacy

Note that I'm manually specifying the --paths.train and --paths.dev parameter here. You should be able to connect these parameters with the settings in the full-config.cfg file. Here's what the output looked like on my machine:

ℹ No output directory provided
ℹ Using CPU

=========================== Initializing pipeline ===========================
[2022-09-13 10:39:51,656] [INFO] Set up nlp object from config
[2022-09-13 10:39:51,664] [INFO] Pipeline: ['transformer', 'ner']
[2022-09-13 10:39:51,667] [INFO] Created vocabulary
[2022-09-13 10:39:51,668] [INFO] Finished initializing nlp object
/home/vincent/Development/prodigy-demos/venv/lib/python3.8/site-packages/torch/amp/autocast_mode.py:198: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling
  warnings.warn('User provided device_type of \'cuda\', but CUDA is not available. Disabling')
[2022-09-13 10:40:01,957] [INFO] Initialized pipeline components: ['transformer', 'ner']
✔ Initialized pipeline

============================= Training pipeline =============================
ℹ Pipeline: ['transformer', 'ner']
ℹ Initial learn rate: 0.0
E    #       LOSS TRANS...  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  -------------  --------  ------  ------  ------  ------
  0       0           7.57      3.52    0.00    0.00    0.00    0.00
200     200         164.92    130.18  100.00  100.00  100.00    1.00
...

You'll notice that I'm getting warnings about the lack of a GPU. That's the main downside of transformer models. Without a heavy compute resource the really take long to train (which is why I personally try to omit them when possible).

Final Notes

You might notice that there's a bunch of commands here. To keep everything consistent it is recommended to use the spaCy projects feature. You can learn more about it on our docs or in this YouTube video.

2 Likes