Training a relation extraction component

As per the error, it sounds like you may not be passing the tutorial’s config that creates this component.

Are you running your command like the tutorial does:

python -m spacy train configs/rel_tok2vec.cfg --output training --paths.train train.spacy --paths.dev dev.spacy -c ./scripts/custom_functions.py

I always runned :

# this will run the default parse_data, train_cpu, and evaluate commands
python -m spacy project run all 

Now, in rel_component folder, I've runned :

python -m spacy project run train_cpu

================================= train_cpu =================================
ℹ Skipping 'train_cpu': nothing changed

then

python -m spacy train configs/rel_tok2vec.cfg --output training --paths.train train.spacy --paths.dev dev.spacy -c ./scripts/custom_functions.py

and obtained

FileNotFoundError: [Errno 2] No such file or directory: 'train.spacy'

But isn't this command supposed to be the same than python -m spacy project run train_cpu ? I don't get this.

Hi @stella,

You’re having a simple lookup issue.I wasn’t sure if you were still running this as a spaCy project or standalone, so I provided a stand-alone command that assumed your spacy binary files were in the same folder.

I would recommend keeping this as a spaCy project built from Sofie’s tutorial. You would need to modify the project.yml to replace with your new parse_data.py. The spaCy files should then go into the data folder, which is where the training command is looking for the files.

Hi Ryan,

I'm not sure I understand your answer. Is the lookup issue related to using your stand-alone command ? I have no issue when using the following command to generate the model :

# this will run the default parse_data, train_cpu, and evaluate commands
python -m spacy project run all 

I think there is still an issue when trying to load the generated model. I'm trying to load model-best folder with spacy.load.

Whether I try to load it in the rel_component project repository or in another repository, I still get the error :

ValueError: [E002] Can't find factory for 'relation_extractor' for language English (en). This usually happens when spaCy calls `nlp.create_pipe` with a custom component name that's not registered on the current language class. If you're using a Transformer, make sure to install 'spacy-transformers'. If you're using a custom component, make sure you've added the decorator `@Language.component` (for function components) or `@Language.factory` (for class components).

Available factories: attribute_ruler, tok2vec, merge_noun_chunks, merge_entities, merge_subtokens, token_splitter, doc_cleaner, parser, beam_parser, lemmatizer, trainable_lemmatizer, entity_linker, ner, beam_ner, entity_ruler, tagger, morphologizer, senter, sentencizer, textcat, spancat, spancat_singlelabel, future_entity_ruler, span_ruler, textcat_multilabel, en.lemmatizer

I'd need a clearer explanation for solving it, as for the moment I can't load my generated model. Is the issue related to the generation of the model or to the way I'm trying to load it ?

Thank you and sorry if I didn't get what you meant.

Hi Stella!

rel_pipe.py defines the relation_extractor factory, and this code needs to be imported when you want to load the model back in. When you're using custom code to load the generated model, you can for instance add rel_pipe.py to your working folder and import make_relation_extractor from it at the top of your script - this will ensure that spaCy has seen the factory definition and is able to load your model.

In the tutorial, during training this bit is accomplished by importing custom_functions.py which, as you can see from the link, imports the right bits to make the factory & config work. In the spacy train command, this is accomplished by adding -c custom_functions.py to the command.

Hi Sofie,

Thanks for your quick answer !

Great, so let's say that I'm trying to import the model in test_project_rel.py (in your architecture).

I've added :

from scripts.rel_pipe import make_relation_extractor

and now the error has changed, it is :

catalogue.RegistryError: [E893] Could not find function 'rel_model.v1' in function registry 'architectures'. If you're using a custom function, make sure the code is available. If the function is provided by a third-party package, e.g. spacy-transformers, make sure the package is installed in your environment.

Available names: spacy-legacy.CharacterEmbed.v1, spacy-legacy.EntityLinker.v1, spacy-legacy.HashEmbedCNN.v1, spacy-legacy.MaxoutWindowEncoder.v1, spacy-legacy.MishWindowEncoder.v1, spacy-legacy.MultiHashEmbed.v1, spacy-legacy.Tagger.v1, spacy-legacy.TextCatBOW.v1, spacy-legacy.TextCatCNN.v1, spacy-legacy.TextCatEnsemble.v1, spacy-legacy.Tok2Vec.v1, spacy-legacy.TransitionBasedParser.v1, spacy-transformers.Tok2VecTransformer.v1, spacy-transformers.Tok2VecTransformer.v2, spacy-transformers.Tok2VecTransformer.v3, spacy-transformers.TransformerListener.v1, spacy-transformers.TransformerModel.v1, spacy-transformers.TransformerModel.v2, spacy-transformers.TransformerModel.v3, spacy.CharacterEmbed.v2, spacy.EntityLinker.v2, spacy.HashEmbedCNN.v2, spacy.MaxoutWindowEncoder.v2, spacy.MishWindowEncoder.v2, spacy.MultiHashEmbed.v2, spacy.PretrainCharacters.v1, spacy.PretrainVectors.v1, spacy.SpanCategorizer.v1, spacy.Tagger.v2, spacy.TextCatBOW.v2, spacy.TextCatCNN.v2, spacy.TextCatEnsemble.v2, spacy.TextCatLowData.v1, spacy.Tok2Vec.v2, spacy.Tok2VecListener.v1, spacy.TorchBiLSTMEncoder.v1, spacy.TransitionBasedParser.v2

So I've also added :

import scripts.rel_model

And now there's no error, only a message when executing the code :

ℹ Could not determine any instances in doc - returning doc as is.

I've used the following example from the documentation and tried to do NER on it :

Apple is looking at buying U.K. startup for $1 billion

It means that even if I used :

prodigy rel.manual ner_rels en_core_web_sm ...

I can't do NER with named entities from en_core_web_sm ?

Also, if there were relations extracted in the text (and it is not the case given the size of my annotations.jsonl file and the given text), I could access them with doc..rel, that's correct ? Are there any specific information stored in ..rel ? Like start, end, label etc, like for .ents ? I'm sure you explained this somewhere but could not find the information in the video.

Thanks for being so patient with me !

Sorry, but I thought you were dealing with the issue of training in your spaCy project, right?

I'm not sure why you'd need to modify this script for this purpose. Let's return back to where you were here:

First, since you've run a few practice training in your folder, make sure to run first python -m spacy project run clean, which runs this.

Also, make sure to remove the old annotations.jsonl data and replace it with your data.

Next, have you updated here your project.yml to reference the new parse_data_generic.py, not the parse_data.py file that is used by default?

If you don't do these steps, running spacy project run all will either not update (because you are using old files) or reference the parse_data.py in the data command as part of the all step.

With those two changes, the data command should export the spaCy binary files into the right place (/data folder) that the cpu_train command is looking for.

Instead of running python -m spacy project run all, I would recommend running each step individually so you can check and verify:

# fyi this is the same as running spacy project run all
python -m spacy project run data # check to see your binary files are in data folder
python -m spacy project run cpu_train
python -m spacy project run evaluate

Here's what the custom attribute doc._.rel includes:

This part of the video (25:30) explains it:

Hope this helps!

Hi Ryan,

Sorry, maybe I had troubles explaining the issue.

Generating the model runs fine, now. Thanks for your last answer. There was an issue when loading the model elsewhere (which is why I tried to load it in test_project_rel.py). Sofie explained to me how to load it correctly (you just need to import some of her librairies), so now everything is okay.

My last misunderstanding is the following : when you load your model, even if you build it on top on the English model (with prodigy rel.manual ner_rels en_core_web_sm ...etc), you can't do NER with the simple example in the documentation. It looks actually logical : there is no information in the annotations.jsonl file concerning the standard English model. So why do you have the possibility to build your custom model on top of the standard one ? Does it mean you only use the POS tags of the English model ? All named entities are not imported in your custom model ?

Thanks.

Maybe there is still an issue, actually.

When loading any generated model (on Sofie's annotations, or on mine) and applying it on any text (even from training data), trying to do NER and relation extraction on it, no named entity nor relation is extracted. The doc is always returned as it is.

It is correct to load the folder model-best with spacy.load() ? Am I doing anything wrong ?

I can confirm there is still an issue. Please don't let it unresolved, I really need the relation extraction component to work on my side.

I'll give you more information about my parameters.

I'm working on Sofie's annotation.jsonl file.

I've edited the project.yml file with

    script:
      - "python ./scripts/parse_data_generic.py ${vars.annotations} ${vars.train_file} ${vars.dev_file} ${vars.test_file}"

My parse_data_generic.py contains :

SYMM_LABELS = ["Binds"]
DIRECTED_LABELS = ["Pos-Reg", "Neg-Reg", "No-rel", "Regulates"]

I've runned :

python -m spacy project run clean
python -m spacy project run data
python -m spacy project run train_cpu
python -m spacy project run evaluate

This is the output of train_cpu :

================================= train_cpu =================================
Running command: /usr/bin/python -m spacy train configs/rel_tok2vec.cfg --output training --paths.train data/train.spacy --paths.dev data/dev.spacy -c ./scripts/custom_functions.py
ℹ Saving to output directory: training
ℹ Using CPU

=========================== Initializing pipeline ===========================
[2023-03-27 15:34:04,066] [INFO] Set up nlp object from config
[2023-03-27 15:34:04,072] [INFO] Pipeline: ['tok2vec', 'relation_extractor']
[2023-03-27 15:34:04,074] [INFO] Created vocabulary
[2023-03-27 15:34:04,074] [INFO] Finished initializing nlp object
[2023-03-27 15:34:04,103] [INFO] Initialized pipeline components: ['tok2vec', 'relation_extractor']
✔ Initialized pipeline

============================= Training pipeline =============================
ℹ Pipeline: ['tok2vec', 'relation_extractor']
ℹ Initial learn rate: 0.001
E    #       LOSS TOK2VEC  LOSS RELAT...  REL_MICRO_P  REL_MICRO_R  REL_MICRO_F  SCORE 
---  ------  ------------  -------------  -----------  -----------  -----------  ------
  0       0          0.06           0.71        21.88        35.00        26.92    0.27
220     500          0.21           4.58        63.64        35.00        45.16    0.45
625    1000          0.00           0.00        63.64        35.00        45.16    0.45
1125    1500          0.00           0.00        63.64        35.00        45.16    0.45
1625    2000          0.00           0.00        66.67        40.00        50.00    0.50
2125    2500          0.00           0.00        63.64        35.00        45.16    0.45
2625    3000          0.00           0.00        63.64        35.00        45.16    0.45
3125    3500          0.00           0.00        63.64        35.00        45.16    0.45
3625    4000          0.00           0.00        63.64        35.00        45.16    0.45
4125    4500          0.00           0.00        63.64        35.00        45.16    0.45
4625    5000          0.00           0.00        63.64        35.00        45.16    0.45
5125    5500          0.00           0.00        60.00        30.00        40.00    0.40
5625    6000          0.00           0.00        63.64        35.00        45.16    0.45
6125    6500          0.00           0.00        60.00        30.00        40.00    0.40
6625    7000          0.00           0.00        60.00        30.00        40.00    0.40
7125    7500          0.04           0.19        50.00        30.00        37.50    0.37
7625    8000          0.00           0.00        50.00        30.00        37.50    0.37
8125    8500          0.00           0.00        50.00        30.00        37.50    0.37
8625    9000          0.00           0.00        50.00        30.00        37.50    0.37
9125    9500          0.00           0.00        50.00        30.00        37.50    0.37
9625   10000          0.00           0.00        50.00        30.00        37.50    0.37
✔ Saved pipeline to output directory
training/model-last

This is the output of evaluate :

Running command: /usr/bin/python ./scripts/evaluate.py training/model-best data/test.spacy False

Random baseline:
threshold 0.00   {'rel_micro_p': '9.26', 'rel_micro_r': '100.00', 'rel_micro_f': '16.95'}
threshold 0.05   {'rel_micro_p': '9.52', 'rel_micro_r': '100.00', 'rel_micro_f': '17.39'}
threshold 0.10   {'rel_micro_p': '9.28', 'rel_micro_r': '90.00', 'rel_micro_f': '16.82'}
threshold 0.20   {'rel_micro_p': '8.79', 'rel_micro_r': '80.00', 'rel_micro_f': '15.84'}
threshold 0.30   {'rel_micro_p': '9.88', 'rel_micro_r': '80.00', 'rel_micro_f': '17.58'}
threshold 0.40   {'rel_micro_p': '11.94', 'rel_micro_r': '80.00', 'rel_micro_f': '20.78'}
threshold 0.50   {'rel_micro_p': '12.50', 'rel_micro_r': '70.00', 'rel_micro_f': '21.21'}
threshold 0.60   {'rel_micro_p': '6.82', 'rel_micro_r': '30.00', 'rel_micro_f': '11.11'}
threshold 0.70   {'rel_micro_p': '5.88', 'rel_micro_r': '20.00', 'rel_micro_f': '9.09'}
threshold 0.80   {'rel_micro_p': '5.26', 'rel_micro_r': '10.00', 'rel_micro_f': '6.90'}
threshold 0.90   {'rel_micro_p': '0.00', 'rel_micro_r': '0.00', 'rel_micro_f': '0.00'}
threshold 0.99   {'rel_micro_p': '0.00', 'rel_micro_r': '0.00', 'rel_micro_f': '0.00'}
threshold 1.00   {'rel_micro_p': '0.00', 'rel_micro_r': '0.00', 'rel_micro_f': '0.00'}

Results of the trained model:
threshold 0.00   {'rel_micro_p': '9.26', 'rel_micro_r': '100.00', 'rel_micro_f': '16.95'}
threshold 0.05   {'rel_micro_p': '22.86', 'rel_micro_r': '80.00', 'rel_micro_f': '35.56'}
threshold 0.10   {'rel_micro_p': '25.81', 'rel_micro_r': '80.00', 'rel_micro_f': '39.02'}
threshold 0.20   {'rel_micro_p': '34.78', 'rel_micro_r': '80.00', 'rel_micro_f': '48.48'}
threshold 0.30   {'rel_micro_p': '38.10', 'rel_micro_r': '80.00', 'rel_micro_f': '51.61'}
threshold 0.40   {'rel_micro_p': '41.18', 'rel_micro_r': '70.00', 'rel_micro_f': '51.85'}
threshold 0.50   {'rel_micro_p': '40.00', 'rel_micro_r': '60.00', 'rel_micro_f': '48.00'}
threshold 0.60   {'rel_micro_p': '54.55', 'rel_micro_r': '60.00', 'rel_micro_f': '57.14'}
threshold 0.70   {'rel_micro_p': '66.67', 'rel_micro_r': '60.00', 'rel_micro_f': '63.16'}
threshold 0.80   {'rel_micro_p': '75.00', 'rel_micro_r': '60.00', 'rel_micro_f': '66.67'}
threshold 0.90   {'rel_micro_p': '80.00', 'rel_micro_r': '40.00', 'rel_micro_f': '53.33'}
threshold 0.99   {'rel_micro_p': '100.00', 'rel_micro_r': '10.00', 'rel_micro_f': '18.18'}
threshold 1.00   {'rel_micro_p': '0.00', 'rel_micro_r': '0.00', 'rel_micro_f': '0.00'}

So it looks like the model was correctly generated, right ?

But now I want to use it. So that is why I'm trying to load it in the test_project_rel.py file. I've used a long line of text from the training set, to maximize the chances of recognizing something. What I'm doing is :

model = "training/model-best"
    text = "Transcriptional regulation of lysosomal acid lipase in differentiating monocytes is mediated by transcription factors Sp1 and AP-2. \nHuman lysosomal acid lipase (LAL) is a hydrolase required for the cleavage of cholesteryl esters and triglycerides derived from plasma lipoproteins."
    nlp = spacy.load(model)
    doc = nlp(text)
    for ent in doc.ents:
        print(ent.text, ent.start_char, ent.end_char, ent.label_)
    for rel in doc._.rel:
        print(rel)

And I obtain :smile:

ℹ Could not determine any instances in doc - returning doc as is.

Not even named entities are recognized. It is exactly the same with my own annotations.jsonl file (and my named entities and relations labels). Still, when I'm training a regular NER model outside the relation component, everything runs fine.

My team really need this component working, could you please explain to me what is the remaining issue ?

Thank you very much.

hi @stella,

You are getting this because there are no relations predicted in that example. The problem is likely you're not passing any entities with the example.

The original code Sofie provided only trained a relation component, not a ner component. Sofie stated in her video, her config file only trains the relations component, not the ner component (please watch this part: 19:12)

That is, in the default config file, it only has the tok2vec (or transformer) and the relations_extraction component:

pipeline = ["tok2vec", "relation_extractor"]

Since it doesn't have a ner component, it won't automatically train the ner component. But I think you previously created a separate ner which is covered in this spaCy GitHub issue:

if you want to train the ner and the relation_extraction component together, you'll need to set annotating_components = ["ner"] in your config. This will make the NER predictions available to the downstream relation extraction component, so it can use them to predict relations. Alternatively, you could train this in two steps, with two configs: 1 focusing only on the NER, and the second sourcing the trained NER model and then train the relation extraction.

If you want to train with both, make sure to add in the ner components too (not just add annotating_components = ["ner"]).

Sofie mentioned in an older post how to approach this:

But as she mentioned, a 2nd option is doing a two-step training with two configs: first for ner and second for relations (just be sure to source the trained NER model to train the relation extraction. This can be a 2nd option if you can't get the single config running.

Last, have you spent some time looking at the evaluate.py script?

You'll find how to predict relations when you know the entities.

Hope this helps!

Great !

So either I'm modifying the config files, either I'm sourcing the trained NER model.

Could you please be more precise about which changes I should bring to which files ? I've tried to follow your explanations but got a little lost.

1) In case I'm modifying the config file

I should add "ner" to the [nlp] [pipeline] attribute in the rel_tok2vec.cfg file ? (and also "tagger","parser","attribute_ruler" and "lemmatizer" if I wish to ?)

And I should also add annotating_components = ["ner"] in the [training] attribute ?

That's it ?

Do I need to edit rel_trf.cfg, or not ? I guess it's useful when you use transformers, which I'm not doing, correct ? If I used transformers, then I should bring similar changes in this file ?

2) In case I'm sourcing the trained NER model

I should add the code for sourcing the NER model :

source_nlp = spacy.load("path/to/my/ner/model")
nlp = spacy.blank("en") # this line should already exists in the code
nlp.add_pipe("ner", source=source_nlp)

But where exactly should I add that ? Should it be called during the rel component training process, or should I train both components separately and use this code in the test_project_rel file when I'm trying to load my rel model ?

Finally, can I delete parse_data.py if I'm using parse_data_generic.py ?

Thanks

In order to help you help me, here is my non-working config file (that I tried to build by merging the NER model config file and the relations one) :

[paths]
train = null
dev = null
vectors = null
init_tok2vec = null

[system]
gpu_allocator = null
seed = 0

[nlp]
lang = "en"
pipeline = ["tok2vec","tagger","parser","attribute_ruler","lemmatizer","ner"]
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
batch_size = 256
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}

[components]

[components.attribute_ruler]
factory = "attribute_ruler"
scorer = {"@scorers":"spacy.attribute_ruler_scorer.v1"}
validate = false

[components.lemmatizer]
factory = "lemmatizer"
mode = "rule"
model = null
overwrite = false
scorer = {"@scorers":"spacy.lemmatizer_scorer.v1"}

[components.ner]
factory = "ner"
incorrect_spans_key = null
moves = null
scorer = {"@scorers":"spacy.ner_scorer.v1"}
update_with_oracle_cut_size = 100

[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = true
nO = null

[components.ner.model.tok2vec]
@architectures = "spacy.Tok2Vec.v2"

[components.ner.model.tok2vec.embed]
@architectures = "spacy.MultiHashEmbed.v2"
width = 96
attrs = ["NORM","PREFIX","SUFFIX","SHAPE","SPACY"]
rows = [5000,1000,2500,2500,50]
include_static_vectors = false

[components.ner.model.tok2vec.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 96
depth = 4
window_size = 1
maxout_pieces = 3

[components.parser]
factory = "parser"
learn_tokens = false
min_action_freq = 30
moves = null
scorer = {"@scorers":"spacy.parser_scorer.v1"}
update_with_oracle_cut_size = 100

[components.parser.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "parser"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = true
nO = null

[components.parser.model.tok2vec]
@architectures = "spacy.Tok2Vec.v2"

[components.parser.model.tok2vec.embed]
@architectures = "spacy.MultiHashEmbed.v2"
width = 96
attrs = ["NORM","PREFIX","SUFFIX","SHAPE","SPACY"]
rows = [5000,1000,2500,2500,50]
include_static_vectors = false

[components.parser.model.tok2vec.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 96
depth = 4
window_size = 1
maxout_pieces = 3

[components.tagger]
factory = "tagger"
neg_prefix = "!"
overwrite = false
scorer = {"@scorers":"spacy.tagger_scorer.v1"}

[components.tagger.model]
@architectures = "spacy.Tagger.v2"
nO = null
normalize = false

[components.tagger.model.tok2vec]
@architectures = "spacy.Tok2Vec.v2"

[components.tagger.model.tok2vec.embed]
@architectures = "spacy.MultiHashEmbed.v2"
width = 96
attrs = ["NORM","PREFIX","SUFFIX","SHAPE","SPACY"]
rows = [5000,1000,2500,2500,50]
include_static_vectors = false

[components.tagger.model.tok2vec.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 96
depth = 4
window_size = 1
maxout_pieces = 3

[components.tok2vec]
factory = "tok2vec"

[components.tok2vec.model]
@architectures = "spacy.Tok2Vec.v2"

[components.tok2vec.model.embed]
@architectures = "spacy.MultiHashEmbed.v2"
width = 96
attrs = ["NORM","PREFIX","SUFFIX","SHAPE","SPACY"]
rows = [5000,1000,2500,2500,50]
include_static_vectors = false

[components.tok2vec.model.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 96
depth = 4
window_size = 1
maxout_pieces = 3

[corpora]
@readers = "prodigy.MergedCorpus.v1"
eval_split = 0.1
sample_size = 1.0
textcat = null
textcat_multilabel = null
parser = null
tagger = null
senter = null
spancat = null

[corpora.ner]
@readers = "prodigy.NERCorpus.v1"
datasets = ["airbus_collins"]
eval_datasets = []
default_fill = "outside"
incorrect_key = "incorrect_spans"

[training]
train_corpus = "corpora.train"
dev_corpus = "corpora.dev"
seed = ${system:seed}
gpu_allocator = ${system:gpu_allocator}
dropout = 0.1
accumulate_gradient = 1
patience = 5000
max_epochs = 0
max_steps = 100000
eval_frequency = 1000
frozen_components = ["tagger","parser","attribute_ruler","lemmatizer"]
before_to_disk = null
annotating_components = []

[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2
get_length = null

[training.batcher.size]
@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001
t = 0.0

[training.logger]
@loggers = "prodigy.ConsoleLogger.v1"
progress_bar = false

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = true
eps = 0.00000001
learn_rate = 0.001

[training.score_weights]
tag_acc = null
dep_uas = null
dep_las = null
dep_las_per_type = null
sents_p = null
sents_r = null
sents_f = null
lemma_acc = null
ents_f = 1.0
ents_p = 0.0
ents_r = 0.0
ents_per_type = null
speed = 0.0

[pretraining]

[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null

[initialize.components]

[initialize.tokenizer]

I obtain the following error :

  File "spacy/pipeline/pipe.pyx", line 119, in spacy.pipeline.pipe.Pipe._require_labels
ValueError: [E143] Labels for component 'relation_extractor' not initialized. This can be fixed by calling add_label, or by providing a representative batch of examples to the component's `initialize` method.

I also had to remove lines that used console.Logger.

Could you please explain to me how to build a valid config file, and alternatively (which would be the best approach in my opinion) explain to me how to source the NER model and train the relation model ? I need a detailed explanation for making the component work.

Thank you.

Also, concerning the sourcing method, I managed to obtain named entities when loading the rel model with :

nlp.add_pipe("ner", source=source_nlp)

But I still have :

Could not determine any instances in doc - returning doc as is.

And no relations, only named entities. Could it be that I don't have enough annotations, so no relation can be extracted ? Or is the training process of the relations component incorrect ? I imagined that if the named entities were recognized, maybe there was no problem anymore.

hi @stella!

That's great.

It should be fine to ignore those warnings as long as you see the model is actually training. It is just providing a warning for any docs that don't have any relation instances. It's a useful warning for when no data can be found in any of the instances in the dataset, but it's too verbose if it just occurs in some of the batches.

We've written an internal note to see if we can improve the logging for this.

Actually, I had worked to create a config file that combined both (the other option):

[paths]
train = null
dev = null
vectors = null
init_tok2vec = null

[system]
seed = 342
gpu_allocator = null

[nlp]
lang = "en"
pipeline = ["tok2vec","ner","relation_extractor"]
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}
batch_size = 1000

[components]

[components.ner]
factory = "ner"
incorrect_spans_key = "incorrect_spans"
moves = null
scorer = {"@scorers":"spacy.ner_scorer.v1"}
update_with_oracle_cut_size = 100

[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = true
nO = null

[components.ner.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = 96
upstream = "*"

[components.tok2vec]
factory = "tok2vec"

[components.tok2vec.model]
@architectures = "spacy.HashEmbedCNN.v1"
pretrained_vectors = null
width = 96
depth = 2
embed_size = 2000
window_size = 1
maxout_pieces = 3
subword_features = true

[components.relation_extractor]
factory = "relation_extractor"
threshold = 0.5

[components.relation_extractor.model]
@architectures = "rel_model.v1"

[components.relation_extractor.model.create_instance_tensor]
@architectures = "rel_instance_tensor.v1"

[components.relation_extractor.model.create_instance_tensor.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.width}

[components.relation_extractor.model.create_instance_tensor.pooling]
@layers = "reduce_mean.v1"

[components.relation_extractor.model.create_instance_tensor.get_instances]
@misc = "rel_instance_generator.v1"
max_length = 100

[components.relation_extractor.model.classification_layer]
@architectures = "rel_classification_layer.v1"
nI = null
nO = null

[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[training]
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
accumulate_gradient = 1
patience = 1600000
max_epochs = 0
max_steps = 10000
eval_frequency = 500
frozen_components = []
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
before_to_disk = null
annotating_components = ["ner"]
logger = {"@loggers":"spacy.ConsoleLogger.v1"}

[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2
get_length = null

[training.batcher.size]
@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001
learn_rate = 0.001

[training.score_weights]
rel_micro_p = 0.0
rel_micro_r = 0.0
rel_micro_f = 1.0

[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null

[initialize.components]

[initialize.tokenizer]

This can be run similarly to the project.yml as for train_cpu, i.e., replace this config file with the existing one.

This produced the same warning but trained for both ner and relations.

Thanks !

Do you think it means that the sourcing method I used works, or not ?

Your config file produced an error on my end :

/.local/lib/python3.10/site-packages/thinc/layers/reduce_mean.py", line 19, in forward
    Y = model.ops.reduce_mean(cast(Floats2d, Xr.data), Xr.lengths)
  File "thinc/backends/numpy_ops.pyx", line 318, in thinc.backends.numpy_ops.NumpyOps.reduce_mean
AssertionError

By the way, I wanted to try the sourcing method on Sofie's annotations, not mine, to check if I could extract relations with this method, as you said you obtained relations with the cfg method on this data. The problem is that there is no command line in Prodigy for training a NER model from the annotations.jsonl file that already contains annotations ? You need a dataset ? So it seems I can't generate a separate NER model from her annotations ?

It means it works. Just make sure to look at your training performance to make sure you're seeing it's training.

What version of spaCy are you running?

You can run spacy info.

I'm running:

$ python -m spacy info

============================== Info about spaCy ==============================

spaCy version    3.5.1                         
Location         /opt/homebrew/lib/python3.9/site-packages/spacy
Platform         macOS-13.2.1-arm64-arm-64bit  
Python version   3.9.16                        
Pipelines        en_core_web_sm (3.5.0), en_core_web_md (3.5.0)

$ python -m spacy train configs/rel_tok2vec2.cfg --paths.train data/train.spacy --paths.dev data/dev.spacy  -c ./scripts/custom_functions.py

ℹ No output directory provided
ℹ Using CPU
ℹ To switch to GPU 0, use the option: --gpu-id 0

=========================== Initializing pipeline ===========================
[2023-03-30 09:34:48,197] [INFO] Set up nlp object from config
[2023-03-30 09:34:48,203] [INFO] Pipeline: ['tok2vec', 'ner', 'relation_extractor']
[2023-03-30 09:34:48,206] [INFO] Created vocabulary
[2023-03-30 09:34:48,207] [INFO] Finished initializing nlp object
[2023-03-30 09:34:48,415] [INFO] Initialized pipeline components: ['tok2vec', 'ner', 'relation_extractor']
✔ Initialized pipeline

============================= Training pipeline =============================
ℹ Pipeline: ['tok2vec', 'ner', 'relation_extractor']
ℹ Set annotations on update for: ['ner']
ℹ Initial learn rate: 0.001
E    #       LOSS TOK2VEC  LOSS NER  LOSS RELAT...  ENTS_F  ENTS_P  ENTS_R  REL_MICRO_P  REL_MICRO_R  REL_MICRO_F  SCORE 
---  ------  ------------  --------  -------------  ------  ------  ------  -----------  -----------  -----------  ------
ℹ Could not determine any instances in doc.
  0       0          0.00     37.00           0.00    0.00    0.00    0.00         0.00         0.00         0.00    0.00
ℹ Could not determine any instances in doc.
ℹ Could not determine any instances in doc.
ℹ Could not determine any instances in doc.

... [skipping these lines]

ℹ Could not determine any instances in doc.
ℹ Could not determine any instances in doc.
2176   10000      50350.76   6422.90           0.00   60.29   91.11   45.05        20.00        43.75        27.45    0.44

Notice that the final line still showed a NER model trained with an out-of-sample of about 60% F1.

Note I didn't specify an output path as this is only for demo purposes.

Couldn't you db-in ner_dataset annotations.jsonl the annotations to a Prodigy dataset, then run data-to-spacy --ner ner_dataset to transform .jsonl to .spacy?

It works ? It really works ? Great ! Thank you !

I have an older version of SpaCy, it could be it ! It's 3.4.4.

Ah yes, I remember db-in now, there was a command line to do this, for sure. Thanks ! I'll test the sourcing method with Sofie's annotations tomorrow to see if the relations can be printed. I'll keep you updated. Thanks again !