Training a relation extraction component

Okay so I obtain exactly the same error with the sourcing method on Sofie's annotations.

Its :

/.local/lib/python3.10/site-packages/thinc/layers/reduce_mean.py", line 19, in forward
    Y = model.ops.reduce_mean(cast(Floats2d, Xr.data), Xr.lengths)
  File "thinc/backends/numpy_ops.pyx", line 318, in thinc.backends.numpy_ops.NumpyOps.reduce_mean
AssertionError

Same error with SpaCy 3.5.1.

Oh, and, when installing numpy :

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
prodigy 1.11.9 requires spacy<3.5.0,>=3.1.1, but you have spacy 3.5.1 which is incompatible.
fr-core-news-sm 3.4.0 requires spacy<3.5.0,>=3.4.0, but you have spacy 3.5.1 which is incompatible.
en-core-web-sm 3.4.1 requires spacy<3.5.0,>=3.4.0, but you have spacy 3.5.1 which is incompatible.

There seems to exist incompatibilities between prodigy, spacy and thinc versions ? I can't resolve it on my own.

I forgot an important update. I can't upgrade on my own because prodigy is incompatible with spacy 3.5.

When updating prodigy, spacy 3.4 is reinstalled by default.

Any advice ? It doesn't seem to work, actually.

hi @stella,

What version of Prodigy are you running?

You can run python -m prodigy stats.

Prodigy v1.11.10 was released to be compatible with spaCy 3.5.

I was able to run the config file I created with Prodigy v1.11.11 and spaCy 3.5.1 without any issues.

It sounds like you've been dealing with several spaCy installs and uninstalls. As a best first practice, you should consider creating a fresh venv with the most recent version of Prodigy so you can install spaCy 3.5.

It's 1.11.9.

Actually I'm working on a 6 months old PC, with just one clean install. It looks like 1.11.10 was released just one or two months after my install, and was meant to resolve this issue, so no luck for me. :smile:

I'll update everything and see how it works.

Okay, it works !

Weirdly, it trained the whole pipeline, with the ner component, and the performance is way behind my first results.

Could you explain how to deal with doc._.rel ? I obtain something like :
(67, 69)
(69, 67)
But how can you know which relation label and which named entities are represented ?

Thanks

I forgot : I obtained this warning : UserWarning: [W095] Model 'en_pipeline' (0.0.0) was trained with spaCy v3.5 and may not be 100% compatible with the current version (3.4.4). If you see errors or degraded performance, download a newer compatible model or retrain your custom model with the current spaCy version. For more details and available updates, run: python -m spacy validate

Could it be the reason why the performance is not good ? Could it be enhanced ?

Hi @stella,

That's great. Not sure what you mean by "way behind". As Sofie's video describes near the end, relation extraction is a hard task, and justify the need for transformers to improve performance to acceptable levels. I think that trying to train ner and relations jointly, makes the task even harder as you're not just predicting relations but also good relations predictions are contingent on good ner predictions.

As keys, we represent an instance pair by the start offsets of the two entities, which is a unique key within one document.

So those instance pairs (67, 69) are keys that represent the location of the two entities in your relation. Each key then refers to another dictionary that maps each relation label (e.g., "Live", "Visit", "Unrelated") to a score between 0 and 1.

As the example shown above, "Live": 1.0 for the key (0, 6) as it's a directional relationship starting from token 0 ("Laura") to token 6 ("Boston").

Make sure to also update your pretrained spaCy models to 3.5. You can run python -m spacy download en_core_web_xx, modifying xx depending on which model you want to install (see spaCy docs).

As I mentioned before, I think performance would be lower because you're asking for a more difficult task: jointly training ner and relations. Following Sofie's recommendations, you can use transformers, as she shows near the end of her video, but as we've already shown, this will require some customization on your end. I think that it would be a good exercise to try to take our joint config for training and see if you can create a transformers version based on rel_trf.cfg.

Hi Ryan,

It is strange because : I've already updated the base language models. Maybe there is something I'm missing. It looks like the NER model was correctly generated, but the spacy version used in test_project_rel file is different and still the previous one.

What I meant by "way behind" concerns the NER model performance. Before changing spacy version, I used to recognize many named entities and most of them were correct (the label was appropriate, etc). Now, I only recognize two named entities on the same text and the labels are wrong. Is it normal that if affect even the named entity recognition ? I understand that the relation extraction is a hard task, but is it okay if it affects the NER task ?

The strangest thing is that the offsets for the two relations extracted do not correspond to my recognized named entities. The named entities are on 338 - 350 and 352 - 355 characters. Also, I had to remove the source model, it generated a NER model during the training of the relation model, I really don't get why.

I'll try to debug it, but if you have any clue, don't hesitate to let me know.

I still have a strange error :

UserWarning: [W095] Model 'en_pipeline' (0.0.0) was trained with spaCy v3.5 and may not be 100% compatible with the current version (3.5.0). If you see errors or degraded performance, download a newer compatible model or retrain your custom model with the current spaCy version. For more details and available updates, run: python -m spacy validate
  warnings.warn(warn_msg)

I've re downloaded SpaCy's language models, retrained my NER model and I obtain this error when loading the NER model. Is there something I'm missing ?

Thanks

I'm not sure I can add more information than the custom warning is already giving :wink:
It looks like you trained a model with inconsistent models with your spaCy version.

Did you run python -m spacy validate?

I would recommend running spacy info as well. This will state what is your version of spaCy and any pretrained models you have (and their version).

Yes, the warning is pretty explicit, but the reason why it's triggered is less. :sweat_smile:

Here is the output :

================= Installed pipeline packages (spaCy v3.5.1) =================
ℹ spaCy installation:
/home//.local/lib/python3.10/site-packages/spacy

NAME              SPACY            VERSION                            
en_core_web_sm    >=3.5.0,<3.6.0   3.5.0   ✔
fr_core_news_sm   >=3.5.0,<3.6.0   3.5.0   ✔

(venv) spacy info

============================== Info about spaCy ==============================

spaCy version    3.5.1                         
Location         /home//.local/lib/python3.10/site-packages/spacy
Platform         Linux-5.19.0-38-generic-x86_64-with-glibc2.35
Python version   3.10.6                        
Pipelines        en_core_web_sm (3.5.0), fr_core_news_sm (3.5.0)

I think there may be a possible inconsistency between SpaCy which I'm using to train NER model (within a script) and maybe SpaCy in requirements.txt file, maybe used in the test file when loading the model ? I may say very stupid things.

By the way, what is the difference between v3.5 and 3.5.0 ? :sweat_smile:

Yes, I think you're right. Your spacy info output looks fine.

You may want to add python -m or python3 -m to prefix all of your commands. In these weird situations, you may have different virtual environments/Python versions, and you don't realize that you're pointing to different ones at different times.

You may want to run which python and which python3. Do you notice a different path for either of these? You want the path that points to your virtual environment, namely venv/bin/python. You may find that both point to your virtual environment. This is good. That means you can specify either and they'll both point to the same environment.

If that's the case, can you try everything but with the same consistent python -m prefix?

For example:

python -m spacy info
python -m prodigy stats
python -m spacy project run all
python -m spacy train ...

That's a good question. I was wondering the same for your outputs. I'll ping someone on the spaCy core team. My suspicion is that when it says v3.5, it really means v3.5.x, which means any patch release of 3.5.

When running which python or python3 :

(venv) @:~/$ which python
/usr/bin/python

(venv) @:~$ which python3
/usr/bin/python3

Is it correct ? So even if venv is activated (as (venv) is on the beginning of the line), it is not used ?

For Python3

============================== Info about spaCy ==============================

spaCy version    3.5.1                         
Location         /home//.local/lib/python3.10/site-packages/spacy
Platform         Linux-5.19.0-38-generic-x86_64-with-glibc2.35
Python version   3.10.6                        
Pipelines        en_core_web_sm (3.5.0), fr_core_news_sm (3.5.0)

Version          1.11.11                       
Location         /home//.local/lib/python3.10/site-packages/prodigy
Prodigy Home     /home//.prodigy         
Platform         Linux-5.19.0-38-generic-x86_64-with-glibc2.35
Python Version   3.10.6                        
Database Name    SQLite                        
Database Id      sqlite                        
Total Datasets   4                             
Total Sessions   25  

For Python :

============================== Info about spaCy ==============================

spaCy version    3.5.1                         
Location         /home//.local/lib/python3.10/site-packages/spacy
Platform         Linux-5.19.0-38-generic-x86_64-with-glibc2.35
Python version   3.10.6                        
Pipelines        en_core_web_sm (3.5.0), fr_core_news_sm (3.5.0)

============================== ✨  Prodigy Stats ==============================

Version          1.11.11                       
Location         /home//.local/lib/python3.10/site-packages/prodigy
Prodigy Home     /home//.prodigy         
Platform         Linux-5.19.0-38-generic-x86_64-with-glibc2.35
Python Version   3.10.6                        
Database Name    SQLite                        
Database Id      sqlite                        
Total Datasets   4                             
Total Sessions   25    

So it looks like they are exactly the same.

To be sure, I've trained a model for both prefixes (consistent usage of the prefix in each model, naturally).

First, only the NER model. The training is done from a bash script that contains for example :

#!/bin/bash
python3 -m prodigy train ...

(not showing the specific details, the command line is correct and works)

Then I've tried to load them (only the NER model first, then as source for the rel component)

By the way, importing a language in the test file works without triggering anything.

Both NER models work really good (very nice performance) but already trigger this warning :

UserWarning: [W095] Model 'en_pipeline' (0.0.0) was trained with spaCy v3.5 and may not be 100% compatible with the current version (3.5.0). If you see errors or degraded performance, download a newer compatible model or retrain your custom model with the current spaCy version. For more details and available updates, run: python -m spacy validate
  warnings.warn(warn_msg)

The output of python3 -m spacy validate :

python3 -m spacy validate
✔ Loaded compatibility table

================= Installed pipeline packages (spaCy v3.5.1) =================
ℹ spaCy installation:
/home//.local/lib/python3.10/site-packages/spacy

NAME              SPACY            VERSION                            
en_core_web_sm    >=3.5.0,<3.6.0   3.5.0   ✔
fr_core_news_sm   >=3.5.0,<3.6.0   3.5.0   ✔

The output of python -m spacy validate :

(venv) @:~/$ python -m spacy validate
✔ Loaded compatibility table

================= Installed pipeline packages (spaCy v3.5.1) =================
ℹ spaCy installation:
/home/ (/.local/lib/python3.10/site-packages/spacy

NAME              SPACY            VERSION                            
en_core_web_sm    >=3.5.0,<3.6.0   3.5.0   ✔
fr_core_news_sm   >=3.5.0,<3.6.0   3.5.0   ✔

Let's say the warning is triggered for no reason (as v3.5 should comprise 3.5.0). Now I'm still trying to work with the rel component, hoping the warning is useless.

I'm trying to source both NER model separately and see how it works.

I'm using a bash script to train the rel model. It contains :

#!/bin/bash
python3 -m spacy project run clean
python3 -m spacy project run data
python3 -m spacy project run train_cpu
python3 -m spacy project run evaluate

When deleting the content of the data folder and training folder, then executing the script some error is triggered for both prefixes :

  File "thinc/backends/numpy_ops.pyx", line 318, in thinc.backends.numpy_ops.NumpyOps.reduce_mean
AssertionError

✘ Missing dependency specified by command 'evaluate':
training/model-best
Maybe you forgot to run the 'project assets' command or a previous step?

We've already seen this error (the reduce mean one). A model-last is generated but no model-best. Previsouly, this error was due to the model version, right ? But didn't I solve this issue ?

At some point, training both relation models runned fine once (and only once), except for a warning that appeared multiple times with Python 3 :

 Could not determine any instances in doc. 

which is weird, as they both looks identical from prodigy stats and spacy info commands ?

But I can't reproduce this situation despite many attempts.

I'll continue describing the working situation. Finally, when I loaded the rel model alone, it worked. But it had its own ner pipeline, and the NER performance here is very low quality compared to the NER model alone.
I couldn't source the NER model because the ner pipeline already exists in the rel model.

I was wondering if removing anything related to ner in rel_tok2vec.cfg file would do the trick and enable to use the sourcing method. Unfortunately I can't test it because I have that error that I can't remove anymore when executing the script (I've tried to run each command line separately, the train_cpu one is always failing, with both prefixes).

There is still something I don't understand. Why when training the rel component with the ner pipeline, the ner performance (in the rel component) is way behind the ner performance in the ner model alone ? I understand that relation extraction is a hard task, doing both is a hard task, but in that case and if it could work, would the sourcing method increase the performance of the ner task (the NER model being trained alone ?)

Any advice ? I have very less time ahead now to make this work. :sweat_smile:
Sorry about that.

hi @stella,

Can we take a few steps back?

The goal is to reproduce Sofie's rel_component with a joint training of both ner and relations, correct?

If so, I was able to reproduce without any issues.

Here's my steps:

Setup project and virtual environment

$ git clone https://github.com/explosion/projects
$ cd projects
# ensure python 3.9 or 3.10 as Prodigy doesn't have wheels for 3.11 yet
$ python3.9 -m venv venv 
$ source venv/bin/activate
(venv) $ which python3
./projects/venv/bin/python3
(venv) $ python3 --version
Python 3.9.16

When creating the virtual environment, you may not have setup the alias python3.9. You can try python3, but the key is to make sure to setup either Python 3.9 or Python 3.10 (not say, Python 3.11). Prodigy doesn't have setup for Python 3.11 yet.

If you're not familiar with setting up python aliases, you can find lots of material online (e.g., see this search results).

This confirms that using python3 alias is pointing to my virtual environment ./projects/venv/bin/python3.

In the next step, I'll double check that spaCy and Prodigy are pointing to the same venv.

Install Prodigy and check spaCy / Prodigy versions

(venv) $ pip install prodigy -f https://xxxx-xxxx-xxxx-xxxx@download.prodi.gy

[skipping output details]

(venv) $ python3 -m spacy info

============================== Info about spaCy ==============================

spaCy version    3.5.2                         
Location         ./projects/venv/lib/python3.9/site-packages/spacy
Platform         macOS-13.2.1-x86_64-i386-64bit
Python version   3.9.16                        
Pipelines                                      

(venv) $ python3 -m spacy download en_core_web_sm
Collecting en-core-web-sm==3.5.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.5.0/en_core_web_sm-3.5.0-py3-none-any.whl (12.8 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12.8/12.8 MB 
...
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.5.0

[notice] A new release of pip available: 22.3.1 -> 23.0.1
[notice] To update, run: pip install --upgrade pip
✔ Download and installation successful
You can now load the package via spacy.load('en_core_web_sm')
(venv) $ python3 -m spacy info                   

============================== Info about spaCy ==============================

spaCy version    3.5.2                         
Location         ./projects/venv/lib/python3.9/site-packages/spacy
Platform         macOS-13.2.1-x86_64-i386-64bit
Python version   3.9.16                        
Pipelines        en_core_web_sm (3.5.0) 

(venv) $ python3 -m prodigy stats

============================== ✨  Prodigy Stats ==============================

Version          1.11.11                       
Location         ./projects/venv/lib/python3.9/site-packages/prodigy
Prodigy Home     ~/.prodigy  
Platform         macOS-13.2.1-x86_64-i386-64bit
Python Version   3.9.16                        
Database Name    SQLite                        
Database Id      sqlite 

So you can see, both spacy and prodigy are both saved in ./projects/venv/. I also installed an en_core_web_sm model that isn't needed to run rel_component but I'm doing it now to confirm that it's going into the same spacy package and that its version (3.5.0+) is consistent.

Ensure data/assets are available

(venv) $ cd tutorials/rel_component
(venv) $ python3 -m spacy project assets
ℹ Fetching 1 asset(s)
✔ Asset already exists: ./projects/tutorials/rel_component/assets/annotations.jsonl
(venv) $ python3 -m spacy project run data

==================================== data ====================================
Running command: './projects/venv/bin/python3' ./scripts/parse_data.py assets/annotations.jsonl data/train.spacy data/dev.spacy data/test.spacy
ℹ 102 training sentences from 43 articles, 209/2346 pos instances.
ℹ 27 dev sentences from 5 articles, 56/710 pos instances.
ℹ 20 test sentences from 6 articles, 30/340 pos instances.

You can skip the python3 -m spacy project assets step as the asset is already there.

update project.yml with this file

Now, you'll need to manually add this file, rel_joint.cfg into your projects/rel_components/config folder:

[paths]
train = null
dev = null
vectors = null
init_tok2vec = null

[system]
seed = 342
gpu_allocator = null

[nlp]
lang = "en"
pipeline = ["tok2vec","ner","relation_extractor"]
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}
batch_size = 1000

[components]

[components.ner]
factory = "ner"
incorrect_spans_key = "incorrect_spans"
moves = null
scorer = {"@scorers":"spacy.ner_scorer.v1"}
update_with_oracle_cut_size = 100

[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = true
nO = null

[components.ner.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = 96
upstream = "*"

[components.tok2vec]
factory = "tok2vec"

[components.tok2vec.model]
@architectures = "spacy.HashEmbedCNN.v1"
pretrained_vectors = null
width = 96
depth = 2
embed_size = 2000
window_size = 1
maxout_pieces = 3
subword_features = true

[components.relation_extractor]
factory = "relation_extractor"
threshold = 0.5

[components.relation_extractor.model]
@architectures = "rel_model.v1"

[components.relation_extractor.model.create_instance_tensor]
@architectures = "rel_instance_tensor.v1"

[components.relation_extractor.model.create_instance_tensor.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.width}

[components.relation_extractor.model.create_instance_tensor.pooling]
@layers = "reduce_mean.v1"

[components.relation_extractor.model.create_instance_tensor.get_instances]
@misc = "rel_instance_generator.v1"
max_length = 100

[components.relation_extractor.model.classification_layer]
@architectures = "rel_classification_layer.v1"
nI = null
nO = null

[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[training]
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
accumulate_gradient = 1
patience = 1600000
max_epochs = 0
max_steps = 10000
eval_frequency = 500
frozen_components = []
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
before_to_disk = null
annotating_components = ["ner"]
logger = {"@loggers":"spacy.ConsoleLogger.v1"}

[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2
get_length = null

[training.batcher.size]
@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001
learn_rate = 0.001

[training.score_weights]
rel_micro_p = 0.0
rel_micro_r = 0.0
rel_micro_f = 1.0

[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null

[initialize.components]

[initialize.tokenizer]

Then, you need to update your rel_component/project.yml file to include commands for joint training:

title: "Example project of creating a novel nlp component to do relation extraction from scratch."
description: "This example project shows how to implement a spaCy component with a custom Machine Learning model, how to train it with and without a transformer, and how to apply it on an evaluation dataset."

# Variables can be referenced across the project.yml using ${vars.var_name}
vars:
  annotations: "assets/annotations.jsonl"
  tok2vec_config: "configs/rel_tok2vec.cfg"
  trf_config: "configs/rel_trf.cfg"
  joint_config: "configs/rel_joint.cfg"
  train_file: "data/train.spacy"
  dev_file: "data/dev.spacy"
  test_file: "data/test.spacy"
  trained_model: "training/model-best"

# These are the directories that the project needs. The project CLI will make
# sure that they always exist.
directories: ["scripts", "configs", "assets", "data", "training"]

# Assets that should be downloaded or available in the directory. You can replace
# this with your own input data.
assets:
    - dest: ${vars.annotations}
      description: "Gold-standard REL annotations created with Prodigy"

workflows:
  all:
    - data
    - train_cpu
    - evaluate
  all_gpu:
    - data
    - train_gpu
    - evaluate

# Project commands, specified in a style similar to CI config files (e.g. Azure
# pipelines). The name is the command name that lets you trigger the command
# via "spacy project run [command] [path]". The help message is optional and
# shown when executing "spacy project run [optional command] [path] --help".
commands:
  - name: "data"
    help: "Parse the gold-standard annotations from the Prodigy annotations."
    script:
      - "python ./scripts/parse_data.py ${vars.annotations} ${vars.train_file} ${vars.dev_file} ${vars.test_file}"
    deps:
      - ${vars.annotations}
    outputs:
      - ${vars.train_file}
      - ${vars.dev_file}
      - ${vars.test_file}

  - name: "train_cpu"
    help: "Train the REL model on the CPU and evaluate on the dev corpus."
    script:
      - "python -m spacy train ${vars.tok2vec_config} --output training --paths.train ${vars.train_file} --paths.dev ${vars.dev_file} -c ./scripts/custom_functions.py"
    deps:
      - ${vars.train_file}
      - ${vars.dev_file}
    outputs:
      - ${vars.trained_model}

  - name: "train_joint_cpu"
    help: "Jointly train the NER and REL model on the CPU and evaluate on the dev corpus."
    script:
      - "python -m spacy train ${vars.joint_config} --output training --paths.train ${vars.train_file} --paths.dev ${vars.dev_file} -c ./scripts/custom_functions.py"
    deps:
      - ${vars.train_file}
      - ${vars.dev_file}
    outputs:
      - ${vars.trained_model}

  - name: "train_gpu"
    help: "Train the REL model with a Transformer on a GPU and evaluate on the dev corpus."
    script:
      - "python -m spacy train ${vars.trf_config} --output training --paths.train ${vars.train_file} --paths.dev ${vars.dev_file} -c ./scripts/custom_functions.py --gpu-id 0"
    deps:
      - ${vars.train_file}
      - ${vars.dev_file}
    outputs:
      - ${vars.trained_model}

  - name: "evaluate"
    help: "Apply the best model to new, unseen text, and measure accuracy at different thresholds."
    script:
      - "python ./scripts/evaluate.py ${vars.trained_model} ${vars.test_file} False"
    deps:
      - ${vars.trained_model}
      - ${vars.test_file}


  - name: "clean"
    help: "Remove intermediate files to start data preparation and training from a clean slate."
    script:
      - "rm -rf data/*"
      - "rm -rf training/*"

The two additions were (1) adding joint_config: "configs/rel_joint.cfg" in vars and (2) adding name: "train_joint_cpu" command.

Run joint training

See at the top that Running command: './projects/venv/bin/python3' -m spacy train, that is, it's running on the venv we set up.

(venv) $ python3 -m spacy project run train_joint_cpu

============================== train_joint_cpu ==============================
Running command: './projects/venv/bin/python3' -m spacy train configs/rel_joint.cfg --output training --paths.train data/train.spacy --paths.dev data/dev.spacy -c ./scripts/custom_functions.py
ℹ Saving to output directory: training
ℹ Using CPU

=========================== Initializing pipeline ===========================
[2023-04-13 07:55:58,478] [INFO] Set up nlp object from config
[2023-04-13 07:55:58,486] [INFO] Pipeline: ['tok2vec', 'ner', 'relation_extractor']
[2023-04-13 07:55:58,488] [INFO] Created vocabulary
[2023-04-13 07:55:58,489] [INFO] Finished initializing nlp object
[2023-04-13 07:55:58,685] [INFO] Initialized pipeline components: ['tok2vec', 'ner', 'relation_extractor']
✔ Initialized pipeline

============================= Training pipeline =============================
ℹ Pipeline: ['tok2vec', 'ner', 'relation_extractor']
ℹ Set annotations on update for: ['ner']
ℹ Initial learn rate: 0.001
E    #       LOSS TOK2VEC  LOSS NER  LOSS RELAT...  ENTS_F  ENTS_P  ENTS_R  REL_MICRO_P  REL_MICRO_R  REL_MICRO_F  SCORE 
---  ------  ------------  --------  -------------  ------  ------  ------  -----------  -----------  -----------  ------
ℹ Could not determine any instances in doc.
  0       0          0.00     37.00           0.00    0.00    0.00    0.00         0.00         0.00         0.00    0.00
ℹ Could not determine any instances in doc.
ℹ Could not determine any instances in doc.
.
[skipping a lot of output]
.
ℹ Could not determine any instances in doc.
2176   10000      50322.12   6468.84           0.00   61.31   91.30   46.15        22.22        50.00        30.77    0.46
✔ Saved pipeline to output directory
training/model-last

As I discussed earlier, you can ignore the Could not determine any instances in doc. as those are docs that there are not trainable examples.

I'm skipping the evaluate command because that was created for only the relations. I'll leave that as an exercise for you to consider how to modify that or create a separate script for the ner training.

Using your new model

(venv) $ python3
Python 3.9.16 (main, Mar 29 2023, 14:53:38) 
[Clang 14.0.0 (clang-1400.0.29.202)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import spacy
>>> import scripts.rel_model
>>> from scripts.rel_pipe import make_relation_extractor
>>> nlp = spacy.load("training/model-best")
>>> doc = nlp("However, BMP-6 did not induce significant changes in the protein expression of Id2 and Id3.")
>>> doc._.rel
{(2, 13): {'Regulates': 0.645058, 'Binds': 0.93420774}, (2, 15): {'Regulates': 0.1442759, 'Binds': 0.07510549}, (13, 2): {'Regulates': 0.9587375, 'Binds': 0.8925074}, (13, 15): {'Regulates': 0.25181648, 'Binds': 0.08782529}, (15, 2): {'Regulates': 0.96757656, 'Binds': 0.4590491}, (15, 13): {'Regulates': 0.8233059, 'Binds': 0.63244414}}
>>> doc.ents
(BMP-6, Id2, Id3)

:tada:

For your last question - I had previously misinterpreted your question. I thought you were asking why did rel performance decrease with adding ner joint training.

I see a bit of what you're saying -- using the config above and only running ner (i.e., changing pipeline = ["tok2vec","ner","relation_extractor"] to pipeline = ["tok2vec","ner"] so only ner trains) does seem to improve the performance of ner.

Can you post that question on spaCy's GitHub discussions since that's a specific spaCy question? I think you need their expertise as it could be something small I missed or need to optimize in my config file.

This forum is really for Prodigy questions and a lot of your questions are spaCy problems. I've tried my best to answer them so you don't have to go back-and-forth, but I think this is would be a good point so they can focus on that specific problem.

Hope this helps!

1 Like

Hi Ryan,

Thanks for the very detailed explanation !

On Sofie's data, it works perfectly.

On my data, I had to edit the parse_data.py file as you told me to in early answers on this thread.

Still, there is a remaining issue that I'd need you to answer.

The error :

python3 -m spacy project run data

==================================== data ====================================
Running command: /home///venv/bin/python3 ./scripts/parse_data.py assets/annotations.jsonl data/train.spacy data/dev.spacy data/test.spacy
Traceback (most recent call last):

  File "/home//////rel_component/./scripts/parse_data.py", line 136, in <module>
    typer.run(main)

  File "/home//////rel_component/./scripts/parse_data.py", line 78, in main
    end = span_end_to_start[relation["child"]]

KeyError: 46

The parse data file :

#parse_data.py
import json
import random

import typer
from pathlib import Path

from spacy.tokens import Span, DocBin, Doc
from spacy.vocab import Vocab
from wasabi import Printer

msg = Printer()

SYMM_LABELS = ["1", "3"]
MAP_LABELS = {
    "1": "1",
    "2": "2", 
    "3": "3",
}


def main(json_loc: Path, train_file: Path, dev_file: Path, test_file: Path):
    """Creating the corpus from the Prodigy annotations."""
    random.seed(0)
    Doc.set_extension("rel", default={})
    vocab = Vocab()

    docs = {"train": [], "dev": [], "test": []}
    ids = {"train": set(), "dev": set(), "test": set()}
    count_all = {"train": 0, "dev": 0, "test": 0}
    count_pos = {"train": 0, "dev": 0, "test": 0}

    with json_loc.open("r", encoding="utf8") as jsonfile:
        for line in jsonfile:
            example = json.loads(line)
            span_starts = set()
            if example["answer"] == "accept":
                neg = 0
                pos = 0
                # Parse the tokens
                words = [t["text"] for t in example["tokens"]]
                spaces = [t["ws"] for t in example["tokens"]]
                doc = Doc(vocab, words=words, spaces=spaces)

                # Parse the GGP entities
                spans = example["spans"]
                entities = []
                span_end_to_start = {}
                for span in spans:
                    entity = doc.char_span(
                        span["start"], span["end"], label=span["label"]
                    )
                    span_end_to_start[span["token_end"]] = span["token_start"]
                    entities.append(entity)
                    span_starts.add(span["token_start"])
                doc.ents = entities

                # Parse the relations
                rels = {}
                for x1 in span_starts:
                    for x2 in span_starts:
                        rels[(x1, x2)] = {}
                relations = example["relations"]
                for relation in relations:
                    # the 'head' and 'child' annotations refer to the end token in the span
                    # but we want the first token
                    start = span_end_to_start[relation["head"]]
                    end = span_end_to_start[relation["child"]]
                    label = relation["label"]
                    label = MAP_LABELS[label]
                    if label not in rels[(start, end)]:
                        rels[(start, end)][label] = 1.0
                        pos += 1
                    if label in SYMM_LABELS:
                        if label not in rels[(end, start)]:
                            rels[(end, start)][label] = 1.0
                            pos += 1

                # The annotation is complete, so fill in zero's where the data is missing
                for x1 in span_starts:
                    for x2 in span_starts:
                        for label in MAP_LABELS.values():
                            if label not in rels[(x1, x2)]:
                                neg += 1
                                rels[(x1, x2)][label] = 0.0
                doc._.rel = rels

                # only keeping documents with at least 1 positive case
                if pos > 0:
                    if random.random() < 0.2:
                        docs["test"].append(doc)
                        count_pos["test"] += pos
                        count_all["test"] += pos + neg
                    elif random.random() < 0.5:
                        docs["dev"].append(doc)
                        count_pos["dev"] += pos
                        count_all["dev"] += pos + neg
                    else:
                        docs["train"].append(doc)
                        count_pos["train"] += pos
                        count_all["train"] += pos + neg

    docbin = DocBin(docs=docs["train"], store_user_data=True)
    docbin.to_disk(train_file)
    msg.info(
        f"{len(docs['train'])} training sentences from {len(ids['train'])} articles, "
        f"{count_pos['train']}/{count_all['train']} pos instances."
    )

    docbin = DocBin(docs=docs["dev"], store_user_data=True)
    docbin.to_disk(dev_file)
    msg.info(
        f"{len(docs['dev'])} dev sentences from {len(ids['dev'])} articles, "
        f"{count_pos['dev']}/{count_all['dev']} pos instances."
    )

    docbin = DocBin(docs=docs["test"], store_user_data=True)
    docbin.to_disk(test_file)
    msg.info(
        f"{len(docs['test'])} test sentences from {len(ids['test'])} articles, "
        f"{count_pos['test']}/{count_all['test']} pos instances."
    )


if __name__ == "__main__":
    typer.run(main)

Thank you. We're almost there.

To adapt to your project, you'd need to edit the parse_data_generic.py, not the parse_data.py. That's why we created it earlier.

Then you'd update these three steps:

Could you retry on the parse_data_generic.py?