Training a relation extraction component

hi @stella,

Can we take a few steps back?

The goal is to reproduce Sofie's rel_component with a joint training of both ner and relations, correct?

If so, I was able to reproduce without any issues.

Here's my steps:

Setup project and virtual environment

$ git clone https://github.com/explosion/projects
$ cd projects
# ensure python 3.9 or 3.10 as Prodigy doesn't have wheels for 3.11 yet
$ python3.9 -m venv venv 
$ source venv/bin/activate
(venv) $ which python3
./projects/venv/bin/python3
(venv) $ python3 --version
Python 3.9.16

When creating the virtual environment, you may not have setup the alias python3.9. You can try python3, but the key is to make sure to setup either Python 3.9 or Python 3.10 (not say, Python 3.11). Prodigy doesn't have setup for Python 3.11 yet.

If you're not familiar with setting up python aliases, you can find lots of material online (e.g., see this search results).

This confirms that using python3 alias is pointing to my virtual environment ./projects/venv/bin/python3.

In the next step, I'll double check that spaCy and Prodigy are pointing to the same venv.

Install Prodigy and check spaCy / Prodigy versions

(venv) $ pip install prodigy -f https://xxxx-xxxx-xxxx-xxxx@download.prodi.gy

[skipping output details]

(venv) $ python3 -m spacy info

============================== Info about spaCy ==============================

spaCy version    3.5.2                         
Location         ./projects/venv/lib/python3.9/site-packages/spacy
Platform         macOS-13.2.1-x86_64-i386-64bit
Python version   3.9.16                        
Pipelines                                      

(venv) $ python3 -m spacy download en_core_web_sm
Collecting en-core-web-sm==3.5.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.5.0/en_core_web_sm-3.5.0-py3-none-any.whl (12.8 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12.8/12.8 MB 
...
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.5.0

[notice] A new release of pip available: 22.3.1 -> 23.0.1
[notice] To update, run: pip install --upgrade pip
✔ Download and installation successful
You can now load the package via spacy.load('en_core_web_sm')
(venv) $ python3 -m spacy info                   

============================== Info about spaCy ==============================

spaCy version    3.5.2                         
Location         ./projects/venv/lib/python3.9/site-packages/spacy
Platform         macOS-13.2.1-x86_64-i386-64bit
Python version   3.9.16                        
Pipelines        en_core_web_sm (3.5.0) 

(venv) $ python3 -m prodigy stats

============================== ✨  Prodigy Stats ==============================

Version          1.11.11                       
Location         ./projects/venv/lib/python3.9/site-packages/prodigy
Prodigy Home     ~/.prodigy  
Platform         macOS-13.2.1-x86_64-i386-64bit
Python Version   3.9.16                        
Database Name    SQLite                        
Database Id      sqlite 

So you can see, both spacy and prodigy are both saved in ./projects/venv/. I also installed an en_core_web_sm model that isn't needed to run rel_component but I'm doing it now to confirm that it's going into the same spacy package and that its version (3.5.0+) is consistent.

Ensure data/assets are available

(venv) $ cd tutorials/rel_component
(venv) $ python3 -m spacy project assets
ℹ Fetching 1 asset(s)
✔ Asset already exists: ./projects/tutorials/rel_component/assets/annotations.jsonl
(venv) $ python3 -m spacy project run data

==================================== data ====================================
Running command: './projects/venv/bin/python3' ./scripts/parse_data.py assets/annotations.jsonl data/train.spacy data/dev.spacy data/test.spacy
ℹ 102 training sentences from 43 articles, 209/2346 pos instances.
ℹ 27 dev sentences from 5 articles, 56/710 pos instances.
ℹ 20 test sentences from 6 articles, 30/340 pos instances.

You can skip the python3 -m spacy project assets step as the asset is already there.

update project.yml with this file

Now, you'll need to manually add this file, rel_joint.cfg into your projects/rel_components/config folder:

[paths]
train = null
dev = null
vectors = null
init_tok2vec = null

[system]
seed = 342
gpu_allocator = null

[nlp]
lang = "en"
pipeline = ["tok2vec","ner","relation_extractor"]
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}
batch_size = 1000

[components]

[components.ner]
factory = "ner"
incorrect_spans_key = "incorrect_spans"
moves = null
scorer = {"@scorers":"spacy.ner_scorer.v1"}
update_with_oracle_cut_size = 100

[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = true
nO = null

[components.ner.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = 96
upstream = "*"

[components.tok2vec]
factory = "tok2vec"

[components.tok2vec.model]
@architectures = "spacy.HashEmbedCNN.v1"
pretrained_vectors = null
width = 96
depth = 2
embed_size = 2000
window_size = 1
maxout_pieces = 3
subword_features = true

[components.relation_extractor]
factory = "relation_extractor"
threshold = 0.5

[components.relation_extractor.model]
@architectures = "rel_model.v1"

[components.relation_extractor.model.create_instance_tensor]
@architectures = "rel_instance_tensor.v1"

[components.relation_extractor.model.create_instance_tensor.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.width}

[components.relation_extractor.model.create_instance_tensor.pooling]
@layers = "reduce_mean.v1"

[components.relation_extractor.model.create_instance_tensor.get_instances]
@misc = "rel_instance_generator.v1"
max_length = 100

[components.relation_extractor.model.classification_layer]
@architectures = "rel_classification_layer.v1"
nI = null
nO = null

[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[training]
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
accumulate_gradient = 1
patience = 1600000
max_epochs = 0
max_steps = 10000
eval_frequency = 500
frozen_components = []
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
before_to_disk = null
annotating_components = ["ner"]
logger = {"@loggers":"spacy.ConsoleLogger.v1"}

[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2
get_length = null

[training.batcher.size]
@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001
learn_rate = 0.001

[training.score_weights]
rel_micro_p = 0.0
rel_micro_r = 0.0
rel_micro_f = 1.0

[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null

[initialize.components]

[initialize.tokenizer]

Then, you need to update your rel_component/project.yml file to include commands for joint training:

title: "Example project of creating a novel nlp component to do relation extraction from scratch."
description: "This example project shows how to implement a spaCy component with a custom Machine Learning model, how to train it with and without a transformer, and how to apply it on an evaluation dataset."

# Variables can be referenced across the project.yml using ${vars.var_name}
vars:
  annotations: "assets/annotations.jsonl"
  tok2vec_config: "configs/rel_tok2vec.cfg"
  trf_config: "configs/rel_trf.cfg"
  joint_config: "configs/rel_joint.cfg"
  train_file: "data/train.spacy"
  dev_file: "data/dev.spacy"
  test_file: "data/test.spacy"
  trained_model: "training/model-best"

# These are the directories that the project needs. The project CLI will make
# sure that they always exist.
directories: ["scripts", "configs", "assets", "data", "training"]

# Assets that should be downloaded or available in the directory. You can replace
# this with your own input data.
assets:
    - dest: ${vars.annotations}
      description: "Gold-standard REL annotations created with Prodigy"

workflows:
  all:
    - data
    - train_cpu
    - evaluate
  all_gpu:
    - data
    - train_gpu
    - evaluate

# Project commands, specified in a style similar to CI config files (e.g. Azure
# pipelines). The name is the command name that lets you trigger the command
# via "spacy project run [command] [path]". The help message is optional and
# shown when executing "spacy project run [optional command] [path] --help".
commands:
  - name: "data"
    help: "Parse the gold-standard annotations from the Prodigy annotations."
    script:
      - "python ./scripts/parse_data.py ${vars.annotations} ${vars.train_file} ${vars.dev_file} ${vars.test_file}"
    deps:
      - ${vars.annotations}
    outputs:
      - ${vars.train_file}
      - ${vars.dev_file}
      - ${vars.test_file}

  - name: "train_cpu"
    help: "Train the REL model on the CPU and evaluate on the dev corpus."
    script:
      - "python -m spacy train ${vars.tok2vec_config} --output training --paths.train ${vars.train_file} --paths.dev ${vars.dev_file} -c ./scripts/custom_functions.py"
    deps:
      - ${vars.train_file}
      - ${vars.dev_file}
    outputs:
      - ${vars.trained_model}

  - name: "train_joint_cpu"
    help: "Jointly train the NER and REL model on the CPU and evaluate on the dev corpus."
    script:
      - "python -m spacy train ${vars.joint_config} --output training --paths.train ${vars.train_file} --paths.dev ${vars.dev_file} -c ./scripts/custom_functions.py"
    deps:
      - ${vars.train_file}
      - ${vars.dev_file}
    outputs:
      - ${vars.trained_model}

  - name: "train_gpu"
    help: "Train the REL model with a Transformer on a GPU and evaluate on the dev corpus."
    script:
      - "python -m spacy train ${vars.trf_config} --output training --paths.train ${vars.train_file} --paths.dev ${vars.dev_file} -c ./scripts/custom_functions.py --gpu-id 0"
    deps:
      - ${vars.train_file}
      - ${vars.dev_file}
    outputs:
      - ${vars.trained_model}

  - name: "evaluate"
    help: "Apply the best model to new, unseen text, and measure accuracy at different thresholds."
    script:
      - "python ./scripts/evaluate.py ${vars.trained_model} ${vars.test_file} False"
    deps:
      - ${vars.trained_model}
      - ${vars.test_file}


  - name: "clean"
    help: "Remove intermediate files to start data preparation and training from a clean slate."
    script:
      - "rm -rf data/*"
      - "rm -rf training/*"

The two additions were (1) adding joint_config: "configs/rel_joint.cfg" in vars and (2) adding name: "train_joint_cpu" command.

Run joint training

See at the top that Running command: './projects/venv/bin/python3' -m spacy train, that is, it's running on the venv we set up.

(venv) $ python3 -m spacy project run train_joint_cpu

============================== train_joint_cpu ==============================
Running command: './projects/venv/bin/python3' -m spacy train configs/rel_joint.cfg --output training --paths.train data/train.spacy --paths.dev data/dev.spacy -c ./scripts/custom_functions.py
ℹ Saving to output directory: training
ℹ Using CPU

=========================== Initializing pipeline ===========================
[2023-04-13 07:55:58,478] [INFO] Set up nlp object from config
[2023-04-13 07:55:58,486] [INFO] Pipeline: ['tok2vec', 'ner', 'relation_extractor']
[2023-04-13 07:55:58,488] [INFO] Created vocabulary
[2023-04-13 07:55:58,489] [INFO] Finished initializing nlp object
[2023-04-13 07:55:58,685] [INFO] Initialized pipeline components: ['tok2vec', 'ner', 'relation_extractor']
✔ Initialized pipeline

============================= Training pipeline =============================
ℹ Pipeline: ['tok2vec', 'ner', 'relation_extractor']
ℹ Set annotations on update for: ['ner']
ℹ Initial learn rate: 0.001
E    #       LOSS TOK2VEC  LOSS NER  LOSS RELAT...  ENTS_F  ENTS_P  ENTS_R  REL_MICRO_P  REL_MICRO_R  REL_MICRO_F  SCORE 
---  ------  ------------  --------  -------------  ------  ------  ------  -----------  -----------  -----------  ------
ℹ Could not determine any instances in doc.
  0       0          0.00     37.00           0.00    0.00    0.00    0.00         0.00         0.00         0.00    0.00
ℹ Could not determine any instances in doc.
ℹ Could not determine any instances in doc.
.
[skipping a lot of output]
.
ℹ Could not determine any instances in doc.
2176   10000      50322.12   6468.84           0.00   61.31   91.30   46.15        22.22        50.00        30.77    0.46
✔ Saved pipeline to output directory
training/model-last

As I discussed earlier, you can ignore the Could not determine any instances in doc. as those are docs that there are not trainable examples.

I'm skipping the evaluate command because that was created for only the relations. I'll leave that as an exercise for you to consider how to modify that or create a separate script for the ner training.

Using your new model

(venv) $ python3
Python 3.9.16 (main, Mar 29 2023, 14:53:38) 
[Clang 14.0.0 (clang-1400.0.29.202)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import spacy
>>> import scripts.rel_model
>>> from scripts.rel_pipe import make_relation_extractor
>>> nlp = spacy.load("training/model-best")
>>> doc = nlp("However, BMP-6 did not induce significant changes in the protein expression of Id2 and Id3.")
>>> doc._.rel
{(2, 13): {'Regulates': 0.645058, 'Binds': 0.93420774}, (2, 15): {'Regulates': 0.1442759, 'Binds': 0.07510549}, (13, 2): {'Regulates': 0.9587375, 'Binds': 0.8925074}, (13, 15): {'Regulates': 0.25181648, 'Binds': 0.08782529}, (15, 2): {'Regulates': 0.96757656, 'Binds': 0.4590491}, (15, 13): {'Regulates': 0.8233059, 'Binds': 0.63244414}}
>>> doc.ents
(BMP-6, Id2, Id3)

:tada:

For your last question - I had previously misinterpreted your question. I thought you were asking why did rel performance decrease with adding ner joint training.

I see a bit of what you're saying -- using the config above and only running ner (i.e., changing pipeline = ["tok2vec","ner","relation_extractor"] to pipeline = ["tok2vec","ner"] so only ner trains) does seem to improve the performance of ner.

Can you post that question on spaCy's GitHub discussions since that's a specific spaCy question? I think you need their expertise as it could be something small I missed or need to optimize in my config file.

This forum is really for Prodigy questions and a lot of your questions are spaCy problems. I've tried my best to answer them so you don't have to go back-and-forth, but I think this is would be a good point so they can focus on that specific problem.

Hope this helps!

1 Like