hi @stella,
Can we take a few steps back?
The goal is to reproduce Sofie's rel_component
with a joint training of both ner
and relations
, correct?
If so, I was able to reproduce without any issues.
Here's my steps:
Setup project and virtual environment
$ git clone https://github.com/explosion/projects
$ cd projects
# ensure python 3.9 or 3.10 as Prodigy doesn't have wheels for 3.11 yet
$ python3.9 -m venv venv
$ source venv/bin/activate
(venv) $ which python3
./projects/venv/bin/python3
(venv) $ python3 --version
Python 3.9.16
When creating the virtual environment, you may not have setup the alias python3.9
. You can try python3
, but the key is to make sure to setup either Python 3.9 or Python 3.10 (not say, Python 3.11). Prodigy doesn't have setup for Python 3.11 yet.
If you're not familiar with setting up python aliases, you can find lots of material online (e.g., see this search results).
This confirms that using python3
alias is pointing to my virtual environment ./projects/venv/bin/python3
.
In the next step, I'll double check that spaCy and Prodigy are pointing to the same venv
.
Install Prodigy and check spaCy / Prodigy versions
(venv) $ pip install prodigy -f https://xxxx-xxxx-xxxx-xxxx@download.prodi.gy
[skipping output details]
(venv) $ python3 -m spacy info
============================== Info about spaCy ==============================
spaCy version 3.5.2
Location ./projects/venv/lib/python3.9/site-packages/spacy
Platform macOS-13.2.1-x86_64-i386-64bit
Python version 3.9.16
Pipelines
(venv) $ python3 -m spacy download en_core_web_sm
Collecting en-core-web-sm==3.5.0
Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.5.0/en_core_web_sm-3.5.0-py3-none-any.whl (12.8 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12.8/12.8 MB
...
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.5.0
[notice] A new release of pip available: 22.3.1 -> 23.0.1
[notice] To update, run: pip install --upgrade pip
✔ Download and installation successful
You can now load the package via spacy.load('en_core_web_sm')
(venv) $ python3 -m spacy info
============================== Info about spaCy ==============================
spaCy version 3.5.2
Location ./projects/venv/lib/python3.9/site-packages/spacy
Platform macOS-13.2.1-x86_64-i386-64bit
Python version 3.9.16
Pipelines en_core_web_sm (3.5.0)
(venv) $ python3 -m prodigy stats
============================== ✨ Prodigy Stats ==============================
Version 1.11.11
Location ./projects/venv/lib/python3.9/site-packages/prodigy
Prodigy Home ~/.prodigy
Platform macOS-13.2.1-x86_64-i386-64bit
Python Version 3.9.16
Database Name SQLite
Database Id sqlite
So you can see, both spacy
and prodigy
are both saved in ./projects/venv/
. I also installed an en_core_web_sm
model that isn't needed to run rel_component
but I'm doing it now to confirm that it's going into the same spacy
package and that its version (3.5.0+) is consistent.
Ensure data/assets are available
(venv) $ cd tutorials/rel_component
(venv) $ python3 -m spacy project assets
ℹ Fetching 1 asset(s)
✔ Asset already exists: ./projects/tutorials/rel_component/assets/annotations.jsonl
(venv) $ python3 -m spacy project run data
==================================== data ====================================
Running command: './projects/venv/bin/python3' ./scripts/parse_data.py assets/annotations.jsonl data/train.spacy data/dev.spacy data/test.spacy
ℹ 102 training sentences from 43 articles, 209/2346 pos instances.
ℹ 27 dev sentences from 5 articles, 56/710 pos instances.
ℹ 20 test sentences from 6 articles, 30/340 pos instances.
You can skip the python3 -m spacy project assets
step as the asset is already there.
update project.yml
with this file
Now, you'll need to manually add this file, rel_joint.cfg
into your projects/rel_components/config
folder:
[paths]
train = null
dev = null
vectors = null
init_tok2vec = null
[system]
seed = 342
gpu_allocator = null
[nlp]
lang = "en"
pipeline = ["tok2vec","ner","relation_extractor"]
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}
batch_size = 1000
[components]
[components.ner]
factory = "ner"
incorrect_spans_key = "incorrect_spans"
moves = null
scorer = {"@scorers":"spacy.ner_scorer.v1"}
update_with_oracle_cut_size = 100
[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = true
nO = null
[components.ner.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = 96
upstream = "*"
[components.tok2vec]
factory = "tok2vec"
[components.tok2vec.model]
@architectures = "spacy.HashEmbedCNN.v1"
pretrained_vectors = null
width = 96
depth = 2
embed_size = 2000
window_size = 1
maxout_pieces = 3
subword_features = true
[components.relation_extractor]
factory = "relation_extractor"
threshold = 0.5
[components.relation_extractor.model]
@architectures = "rel_model.v1"
[components.relation_extractor.model.create_instance_tensor]
@architectures = "rel_instance_tensor.v1"
[components.relation_extractor.model.create_instance_tensor.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.width}
[components.relation_extractor.model.create_instance_tensor.pooling]
@layers = "reduce_mean.v1"
[components.relation_extractor.model.create_instance_tensor.get_instances]
@misc = "rel_instance_generator.v1"
max_length = 100
[components.relation_extractor.model.classification_layer]
@architectures = "rel_classification_layer.v1"
nI = null
nO = null
[corpora]
[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null
[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null
[training]
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
accumulate_gradient = 1
patience = 1600000
max_epochs = 0
max_steps = 10000
eval_frequency = 500
frozen_components = []
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
before_to_disk = null
annotating_components = ["ner"]
logger = {"@loggers":"spacy.ConsoleLogger.v1"}
[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2
get_length = null
[training.batcher.size]
@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001
[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001
learn_rate = 0.001
[training.score_weights]
rel_micro_p = 0.0
rel_micro_r = 0.0
rel_micro_f = 1.0
[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null
[initialize.components]
[initialize.tokenizer]
Then, you need to update your rel_component/project.yml
file to include commands for joint training:
title: "Example project of creating a novel nlp component to do relation extraction from scratch."
description: "This example project shows how to implement a spaCy component with a custom Machine Learning model, how to train it with and without a transformer, and how to apply it on an evaluation dataset."
# Variables can be referenced across the project.yml using ${vars.var_name}
vars:
annotations: "assets/annotations.jsonl"
tok2vec_config: "configs/rel_tok2vec.cfg"
trf_config: "configs/rel_trf.cfg"
joint_config: "configs/rel_joint.cfg"
train_file: "data/train.spacy"
dev_file: "data/dev.spacy"
test_file: "data/test.spacy"
trained_model: "training/model-best"
# These are the directories that the project needs. The project CLI will make
# sure that they always exist.
directories: ["scripts", "configs", "assets", "data", "training"]
# Assets that should be downloaded or available in the directory. You can replace
# this with your own input data.
assets:
- dest: ${vars.annotations}
description: "Gold-standard REL annotations created with Prodigy"
workflows:
all:
- data
- train_cpu
- evaluate
all_gpu:
- data
- train_gpu
- evaluate
# Project commands, specified in a style similar to CI config files (e.g. Azure
# pipelines). The name is the command name that lets you trigger the command
# via "spacy project run [command] [path]". The help message is optional and
# shown when executing "spacy project run [optional command] [path] --help".
commands:
- name: "data"
help: "Parse the gold-standard annotations from the Prodigy annotations."
script:
- "python ./scripts/parse_data.py ${vars.annotations} ${vars.train_file} ${vars.dev_file} ${vars.test_file}"
deps:
- ${vars.annotations}
outputs:
- ${vars.train_file}
- ${vars.dev_file}
- ${vars.test_file}
- name: "train_cpu"
help: "Train the REL model on the CPU and evaluate on the dev corpus."
script:
- "python -m spacy train ${vars.tok2vec_config} --output training --paths.train ${vars.train_file} --paths.dev ${vars.dev_file} -c ./scripts/custom_functions.py"
deps:
- ${vars.train_file}
- ${vars.dev_file}
outputs:
- ${vars.trained_model}
- name: "train_joint_cpu"
help: "Jointly train the NER and REL model on the CPU and evaluate on the dev corpus."
script:
- "python -m spacy train ${vars.joint_config} --output training --paths.train ${vars.train_file} --paths.dev ${vars.dev_file} -c ./scripts/custom_functions.py"
deps:
- ${vars.train_file}
- ${vars.dev_file}
outputs:
- ${vars.trained_model}
- name: "train_gpu"
help: "Train the REL model with a Transformer on a GPU and evaluate on the dev corpus."
script:
- "python -m spacy train ${vars.trf_config} --output training --paths.train ${vars.train_file} --paths.dev ${vars.dev_file} -c ./scripts/custom_functions.py --gpu-id 0"
deps:
- ${vars.train_file}
- ${vars.dev_file}
outputs:
- ${vars.trained_model}
- name: "evaluate"
help: "Apply the best model to new, unseen text, and measure accuracy at different thresholds."
script:
- "python ./scripts/evaluate.py ${vars.trained_model} ${vars.test_file} False"
deps:
- ${vars.trained_model}
- ${vars.test_file}
- name: "clean"
help: "Remove intermediate files to start data preparation and training from a clean slate."
script:
- "rm -rf data/*"
- "rm -rf training/*"
The two additions were (1) adding joint_config: "configs/rel_joint.cfg"
in vars
and (2) adding name: "train_joint_cpu"
command.
Run joint training
See at the top that Running command: './projects/venv/bin/python3' -m spacy train
, that is, it's running on the venv
we set up.
(venv) $ python3 -m spacy project run train_joint_cpu
============================== train_joint_cpu ==============================
Running command: './projects/venv/bin/python3' -m spacy train configs/rel_joint.cfg --output training --paths.train data/train.spacy --paths.dev data/dev.spacy -c ./scripts/custom_functions.py
ℹ Saving to output directory: training
ℹ Using CPU
=========================== Initializing pipeline ===========================
[2023-04-13 07:55:58,478] [INFO] Set up nlp object from config
[2023-04-13 07:55:58,486] [INFO] Pipeline: ['tok2vec', 'ner', 'relation_extractor']
[2023-04-13 07:55:58,488] [INFO] Created vocabulary
[2023-04-13 07:55:58,489] [INFO] Finished initializing nlp object
[2023-04-13 07:55:58,685] [INFO] Initialized pipeline components: ['tok2vec', 'ner', 'relation_extractor']
✔ Initialized pipeline
============================= Training pipeline =============================
ℹ Pipeline: ['tok2vec', 'ner', 'relation_extractor']
ℹ Set annotations on update for: ['ner']
ℹ Initial learn rate: 0.001
E # LOSS TOK2VEC LOSS NER LOSS RELAT... ENTS_F ENTS_P ENTS_R REL_MICRO_P REL_MICRO_R REL_MICRO_F SCORE
--- ------ ------------ -------- ------------- ------ ------ ------ ----------- ----------- ----------- ------
ℹ Could not determine any instances in doc.
0 0 0.00 37.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
ℹ Could not determine any instances in doc.
ℹ Could not determine any instances in doc.
.
[skipping a lot of output]
.
ℹ Could not determine any instances in doc.
2176 10000 50322.12 6468.84 0.00 61.31 91.30 46.15 22.22 50.00 30.77 0.46
✔ Saved pipeline to output directory
training/model-last
As I discussed earlier, you can ignore the Could not determine any instances in doc.
as those are docs that there are not trainable examples.
I'm skipping the evaluate
command because that was created for only the relations
. I'll leave that as an exercise for you to consider how to modify that or create a separate script for the ner
training.
Using your new model
(venv) $ python3
Python 3.9.16 (main, Mar 29 2023, 14:53:38)
[Clang 14.0.0 (clang-1400.0.29.202)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import spacy
>>> import scripts.rel_model
>>> from scripts.rel_pipe import make_relation_extractor
>>> nlp = spacy.load("training/model-best")
>>> doc = nlp("However, BMP-6 did not induce significant changes in the protein expression of Id2 and Id3.")
>>> doc._.rel
{(2, 13): {'Regulates': 0.645058, 'Binds': 0.93420774}, (2, 15): {'Regulates': 0.1442759, 'Binds': 0.07510549}, (13, 2): {'Regulates': 0.9587375, 'Binds': 0.8925074}, (13, 15): {'Regulates': 0.25181648, 'Binds': 0.08782529}, (15, 2): {'Regulates': 0.96757656, 'Binds': 0.4590491}, (15, 13): {'Regulates': 0.8233059, 'Binds': 0.63244414}}
>>> doc.ents
(BMP-6, Id2, Id3)
For your last question - I had previously misinterpreted your question. I thought you were asking why did rel
performance decrease with adding ner
joint training.
I see a bit of what you're saying -- using the config above and only running ner
(i.e., changing pipeline = ["tok2vec","ner","relation_extractor"]
to pipeline = ["tok2vec","ner"]
so only ner
trains) does seem to improve the performance of ner
.
Can you post that question on spaCy's GitHub discussions since that's a specific spaCy question? I think you need their expertise as it could be something small I missed or need to optimize in my config file.
This forum is really for Prodigy questions and a lot of your questions are spaCy problems. I've tried my best to answer them so you don't have to go back-and-forth, but I think this is would be a good point so they can focus on that specific problem.
Hope this helps!