Training a relation extraction component

As per the error, it sounds like you may not be passing the tutorial’s config that creates this component.

Are you running your command like the tutorial does:

python -m spacy train configs/rel_tok2vec.cfg --output training --paths.train train.spacy --paths.dev dev.spacy -c ./scripts/custom_functions.py

I always runned :

# this will run the default parse_data, train_cpu, and evaluate commands
python -m spacy project run all 

Now, in rel_component folder, I've runned :

python -m spacy project run train_cpu

================================= train_cpu =================================
ℹ Skipping 'train_cpu': nothing changed

then

python -m spacy train configs/rel_tok2vec.cfg --output training --paths.train train.spacy --paths.dev dev.spacy -c ./scripts/custom_functions.py

and obtained

FileNotFoundError: [Errno 2] No such file or directory: 'train.spacy'

But isn't this command supposed to be the same than python -m spacy project run train_cpu ? I don't get this.

Hi @stella,

You’re having a simple lookup issue.I wasn’t sure if you were still running this as a spaCy project or standalone, so I provided a stand-alone command that assumed your spacy binary files were in the same folder.

I would recommend keeping this as a spaCy project built from Sofie’s tutorial. You would need to modify the project.yml to replace with your new parse_data.py. The spaCy files should then go into the data folder, which is where the training command is looking for the files.

Hi Ryan,

I'm not sure I understand your answer. Is the lookup issue related to using your stand-alone command ? I have no issue when using the following command to generate the model :

# this will run the default parse_data, train_cpu, and evaluate commands
python -m spacy project run all 

I think there is still an issue when trying to load the generated model. I'm trying to load model-best folder with spacy.load.

Whether I try to load it in the rel_component project repository or in another repository, I still get the error :

ValueError: [E002] Can't find factory for 'relation_extractor' for language English (en). This usually happens when spaCy calls `nlp.create_pipe` with a custom component name that's not registered on the current language class. If you're using a Transformer, make sure to install 'spacy-transformers'. If you're using a custom component, make sure you've added the decorator `@Language.component` (for function components) or `@Language.factory` (for class components).

Available factories: attribute_ruler, tok2vec, merge_noun_chunks, merge_entities, merge_subtokens, token_splitter, doc_cleaner, parser, beam_parser, lemmatizer, trainable_lemmatizer, entity_linker, ner, beam_ner, entity_ruler, tagger, morphologizer, senter, sentencizer, textcat, spancat, spancat_singlelabel, future_entity_ruler, span_ruler, textcat_multilabel, en.lemmatizer

I'd need a clearer explanation for solving it, as for the moment I can't load my generated model. Is the issue related to the generation of the model or to the way I'm trying to load it ?

Thank you and sorry if I didn't get what you meant.

Hi Stella!

rel_pipe.py defines the relation_extractor factory, and this code needs to be imported when you want to load the model back in. When you're using custom code to load the generated model, you can for instance add rel_pipe.py to your working folder and import make_relation_extractor from it at the top of your script - this will ensure that spaCy has seen the factory definition and is able to load your model.

In the tutorial, during training this bit is accomplished by importing custom_functions.py which, as you can see from the link, imports the right bits to make the factory & config work. In the spacy train command, this is accomplished by adding -c custom_functions.py to the command.

Hi Sofie,

Thanks for your quick answer !

Great, so let's say that I'm trying to import the model in test_project_rel.py (in your architecture).

I've added :

from scripts.rel_pipe import make_relation_extractor

and now the error has changed, it is :

catalogue.RegistryError: [E893] Could not find function 'rel_model.v1' in function registry 'architectures'. If you're using a custom function, make sure the code is available. If the function is provided by a third-party package, e.g. spacy-transformers, make sure the package is installed in your environment.

Available names: spacy-legacy.CharacterEmbed.v1, spacy-legacy.EntityLinker.v1, spacy-legacy.HashEmbedCNN.v1, spacy-legacy.MaxoutWindowEncoder.v1, spacy-legacy.MishWindowEncoder.v1, spacy-legacy.MultiHashEmbed.v1, spacy-legacy.Tagger.v1, spacy-legacy.TextCatBOW.v1, spacy-legacy.TextCatCNN.v1, spacy-legacy.TextCatEnsemble.v1, spacy-legacy.Tok2Vec.v1, spacy-legacy.TransitionBasedParser.v1, spacy-transformers.Tok2VecTransformer.v1, spacy-transformers.Tok2VecTransformer.v2, spacy-transformers.Tok2VecTransformer.v3, spacy-transformers.TransformerListener.v1, spacy-transformers.TransformerModel.v1, spacy-transformers.TransformerModel.v2, spacy-transformers.TransformerModel.v3, spacy.CharacterEmbed.v2, spacy.EntityLinker.v2, spacy.HashEmbedCNN.v2, spacy.MaxoutWindowEncoder.v2, spacy.MishWindowEncoder.v2, spacy.MultiHashEmbed.v2, spacy.PretrainCharacters.v1, spacy.PretrainVectors.v1, spacy.SpanCategorizer.v1, spacy.Tagger.v2, spacy.TextCatBOW.v2, spacy.TextCatCNN.v2, spacy.TextCatEnsemble.v2, spacy.TextCatLowData.v1, spacy.Tok2Vec.v2, spacy.Tok2VecListener.v1, spacy.TorchBiLSTMEncoder.v1, spacy.TransitionBasedParser.v2

So I've also added :

import scripts.rel_model

And now there's no error, only a message when executing the code :

ℹ Could not determine any instances in doc - returning doc as is.

I've used the following example from the documentation and tried to do NER on it :

Apple is looking at buying U.K. startup for $1 billion

It means that even if I used :

prodigy rel.manual ner_rels en_core_web_sm ...

I can't do NER with named entities from en_core_web_sm ?

Also, if there were relations extracted in the text (and it is not the case given the size of my annotations.jsonl file and the given text), I could access them with doc..rel, that's correct ? Are there any specific information stored in ..rel ? Like start, end, label etc, like for .ents ? I'm sure you explained this somewhere but could not find the information in the video.

Thanks for being so patient with me !

Sorry, but I thought you were dealing with the issue of training in your spaCy project, right?

I'm not sure why you'd need to modify this script for this purpose. Let's return back to where you were here:

First, since you've run a few practice training in your folder, make sure to run first python -m spacy project run clean, which runs this.

Also, make sure to remove the old annotations.jsonl data and replace it with your data.

Next, have you updated here your project.yml to reference the new parse_data_generic.py, not the parse_data.py file that is used by default?

If you don't do these steps, running spacy project run all will either not update (because you are using old files) or reference the parse_data.py in the data command as part of the all step.

With those two changes, the data command should export the spaCy binary files into the right place (/data folder) that the cpu_train command is looking for.

Instead of running python -m spacy project run all, I would recommend running each step individually so you can check and verify:

# fyi this is the same as running spacy project run all
python -m spacy project run data # check to see your binary files are in data folder
python -m spacy project run cpu_train
python -m spacy project run evaluate

Here's what the custom attribute doc._.rel includes:

This part of the video (25:30) explains it:

Hope this helps!

Hi Ryan,

Sorry, maybe I had troubles explaining the issue.

Generating the model runs fine, now. Thanks for your last answer. There was an issue when loading the model elsewhere (which is why I tried to load it in test_project_rel.py). Sofie explained to me how to load it correctly (you just need to import some of her librairies), so now everything is okay.

My last misunderstanding is the following : when you load your model, even if you build it on top on the English model (with prodigy rel.manual ner_rels en_core_web_sm ...etc), you can't do NER with the simple example in the documentation. It looks actually logical : there is no information in the annotations.jsonl file concerning the standard English model. So why do you have the possibility to build your custom model on top of the standard one ? Does it mean you only use the POS tags of the English model ? All named entities are not imported in your custom model ?

Thanks.

Maybe there is still an issue, actually.

When loading any generated model (on Sofie's annotations, or on mine) and applying it on any text (even from training data), trying to do NER and relation extraction on it, no named entity nor relation is extracted. The doc is always returned as it is.

It is correct to load the folder model-best with spacy.load() ? Am I doing anything wrong ?

I can confirm there is still an issue. Please don't let it unresolved, I really need the relation extraction component to work on my side.

I'll give you more information about my parameters.

I'm working on Sofie's annotation.jsonl file.

I've edited the project.yml file with

    script:
      - "python ./scripts/parse_data_generic.py ${vars.annotations} ${vars.train_file} ${vars.dev_file} ${vars.test_file}"

My parse_data_generic.py contains :

SYMM_LABELS = ["Binds"]
DIRECTED_LABELS = ["Pos-Reg", "Neg-Reg", "No-rel", "Regulates"]

I've runned :

python -m spacy project run clean
python -m spacy project run data
python -m spacy project run train_cpu
python -m spacy project run evaluate

This is the output of train_cpu :

================================= train_cpu =================================
Running command: /usr/bin/python -m spacy train configs/rel_tok2vec.cfg --output training --paths.train data/train.spacy --paths.dev data/dev.spacy -c ./scripts/custom_functions.py
ℹ Saving to output directory: training
ℹ Using CPU

=========================== Initializing pipeline ===========================
[2023-03-27 15:34:04,066] [INFO] Set up nlp object from config
[2023-03-27 15:34:04,072] [INFO] Pipeline: ['tok2vec', 'relation_extractor']
[2023-03-27 15:34:04,074] [INFO] Created vocabulary
[2023-03-27 15:34:04,074] [INFO] Finished initializing nlp object
[2023-03-27 15:34:04,103] [INFO] Initialized pipeline components: ['tok2vec', 'relation_extractor']
✔ Initialized pipeline

============================= Training pipeline =============================
ℹ Pipeline: ['tok2vec', 'relation_extractor']
ℹ Initial learn rate: 0.001
E    #       LOSS TOK2VEC  LOSS RELAT...  REL_MICRO_P  REL_MICRO_R  REL_MICRO_F  SCORE 
---  ------  ------------  -------------  -----------  -----------  -----------  ------
  0       0          0.06           0.71        21.88        35.00        26.92    0.27
220     500          0.21           4.58        63.64        35.00        45.16    0.45
625    1000          0.00           0.00        63.64        35.00        45.16    0.45
1125    1500          0.00           0.00        63.64        35.00        45.16    0.45
1625    2000          0.00           0.00        66.67        40.00        50.00    0.50
2125    2500          0.00           0.00        63.64        35.00        45.16    0.45
2625    3000          0.00           0.00        63.64        35.00        45.16    0.45
3125    3500          0.00           0.00        63.64        35.00        45.16    0.45
3625    4000          0.00           0.00        63.64        35.00        45.16    0.45
4125    4500          0.00           0.00        63.64        35.00        45.16    0.45
4625    5000          0.00           0.00        63.64        35.00        45.16    0.45
5125    5500          0.00           0.00        60.00        30.00        40.00    0.40
5625    6000          0.00           0.00        63.64        35.00        45.16    0.45
6125    6500          0.00           0.00        60.00        30.00        40.00    0.40
6625    7000          0.00           0.00        60.00        30.00        40.00    0.40
7125    7500          0.04           0.19        50.00        30.00        37.50    0.37
7625    8000          0.00           0.00        50.00        30.00        37.50    0.37
8125    8500          0.00           0.00        50.00        30.00        37.50    0.37
8625    9000          0.00           0.00        50.00        30.00        37.50    0.37
9125    9500          0.00           0.00        50.00        30.00        37.50    0.37
9625   10000          0.00           0.00        50.00        30.00        37.50    0.37
✔ Saved pipeline to output directory
training/model-last

This is the output of evaluate :

Running command: /usr/bin/python ./scripts/evaluate.py training/model-best data/test.spacy False

Random baseline:
threshold 0.00   {'rel_micro_p': '9.26', 'rel_micro_r': '100.00', 'rel_micro_f': '16.95'}
threshold 0.05   {'rel_micro_p': '9.52', 'rel_micro_r': '100.00', 'rel_micro_f': '17.39'}
threshold 0.10   {'rel_micro_p': '9.28', 'rel_micro_r': '90.00', 'rel_micro_f': '16.82'}
threshold 0.20   {'rel_micro_p': '8.79', 'rel_micro_r': '80.00', 'rel_micro_f': '15.84'}
threshold 0.30   {'rel_micro_p': '9.88', 'rel_micro_r': '80.00', 'rel_micro_f': '17.58'}
threshold 0.40   {'rel_micro_p': '11.94', 'rel_micro_r': '80.00', 'rel_micro_f': '20.78'}
threshold 0.50   {'rel_micro_p': '12.50', 'rel_micro_r': '70.00', 'rel_micro_f': '21.21'}
threshold 0.60   {'rel_micro_p': '6.82', 'rel_micro_r': '30.00', 'rel_micro_f': '11.11'}
threshold 0.70   {'rel_micro_p': '5.88', 'rel_micro_r': '20.00', 'rel_micro_f': '9.09'}
threshold 0.80   {'rel_micro_p': '5.26', 'rel_micro_r': '10.00', 'rel_micro_f': '6.90'}
threshold 0.90   {'rel_micro_p': '0.00', 'rel_micro_r': '0.00', 'rel_micro_f': '0.00'}
threshold 0.99   {'rel_micro_p': '0.00', 'rel_micro_r': '0.00', 'rel_micro_f': '0.00'}
threshold 1.00   {'rel_micro_p': '0.00', 'rel_micro_r': '0.00', 'rel_micro_f': '0.00'}

Results of the trained model:
threshold 0.00   {'rel_micro_p': '9.26', 'rel_micro_r': '100.00', 'rel_micro_f': '16.95'}
threshold 0.05   {'rel_micro_p': '22.86', 'rel_micro_r': '80.00', 'rel_micro_f': '35.56'}
threshold 0.10   {'rel_micro_p': '25.81', 'rel_micro_r': '80.00', 'rel_micro_f': '39.02'}
threshold 0.20   {'rel_micro_p': '34.78', 'rel_micro_r': '80.00', 'rel_micro_f': '48.48'}
threshold 0.30   {'rel_micro_p': '38.10', 'rel_micro_r': '80.00', 'rel_micro_f': '51.61'}
threshold 0.40   {'rel_micro_p': '41.18', 'rel_micro_r': '70.00', 'rel_micro_f': '51.85'}
threshold 0.50   {'rel_micro_p': '40.00', 'rel_micro_r': '60.00', 'rel_micro_f': '48.00'}
threshold 0.60   {'rel_micro_p': '54.55', 'rel_micro_r': '60.00', 'rel_micro_f': '57.14'}
threshold 0.70   {'rel_micro_p': '66.67', 'rel_micro_r': '60.00', 'rel_micro_f': '63.16'}
threshold 0.80   {'rel_micro_p': '75.00', 'rel_micro_r': '60.00', 'rel_micro_f': '66.67'}
threshold 0.90   {'rel_micro_p': '80.00', 'rel_micro_r': '40.00', 'rel_micro_f': '53.33'}
threshold 0.99   {'rel_micro_p': '100.00', 'rel_micro_r': '10.00', 'rel_micro_f': '18.18'}
threshold 1.00   {'rel_micro_p': '0.00', 'rel_micro_r': '0.00', 'rel_micro_f': '0.00'}

So it looks like the model was correctly generated, right ?

But now I want to use it. So that is why I'm trying to load it in the test_project_rel.py file. I've used a long line of text from the training set, to maximize the chances of recognizing something. What I'm doing is :

model = "training/model-best"
    text = "Transcriptional regulation of lysosomal acid lipase in differentiating monocytes is mediated by transcription factors Sp1 and AP-2. \nHuman lysosomal acid lipase (LAL) is a hydrolase required for the cleavage of cholesteryl esters and triglycerides derived from plasma lipoproteins."
    nlp = spacy.load(model)
    doc = nlp(text)
    for ent in doc.ents:
        print(ent.text, ent.start_char, ent.end_char, ent.label_)
    for rel in doc._.rel:
        print(rel)

And I obtain :smile:

ℹ Could not determine any instances in doc - returning doc as is.

Not even named entities are recognized. It is exactly the same with my own annotations.jsonl file (and my named entities and relations labels). Still, when I'm training a regular NER model outside the relation component, everything runs fine.

My team really need this component working, could you please explain to me what is the remaining issue ?

Thank you very much.

hi @stella,

You are getting this because there are no relations predicted in that example. The problem is likely you're not passing any entities with the example.

The original code Sofie provided only trained a relation component, not a ner component. Sofie stated in her video, her config file only trains the relations component, not the ner component (please watch this part: 19:12)

That is, in the default config file, it only has the tok2vec (or transformer) and the relations_extraction component:

pipeline = ["tok2vec", "relation_extractor"]

Since it doesn't have a ner component, it won't automatically train the ner component. But I think you previously created a separate ner which is covered in this spaCy GitHub issue:

if you want to train the ner and the relation_extraction component together, you'll need to set annotating_components = ["ner"] in your config. This will make the NER predictions available to the downstream relation extraction component, so it can use them to predict relations. Alternatively, you could train this in two steps, with two configs: 1 focusing only on the NER, and the second sourcing the trained NER model and then train the relation extraction.

If you want to train with both, make sure to add in the ner components too (not just add annotating_components = ["ner"]).

Sofie mentioned in an older post how to approach this:

But as she mentioned, a 2nd option is doing a two-step training with two configs: first for ner and second for relations (just be sure to source the trained NER model to train the relation extraction. This can be a 2nd option if you can't get the single config running.

Last, have you spent some time looking at the evaluate.py script?

You'll find how to predict relations when you know the entities.

Hope this helps!

Great !

So either I'm modifying the config files, either I'm sourcing the trained NER model.

Could you please be more precise about which changes I should bring to which files ? I've tried to follow your explanations but got a little lost.

1) In case I'm modifying the config file

I should add "ner" to the [nlp] [pipeline] attribute in the rel_tok2vec.cfg file ? (and also "tagger","parser","attribute_ruler" and "lemmatizer" if I wish to ?)

And I should also add annotating_components = ["ner"] in the [training] attribute ?

That's it ?

Do I need to edit rel_trf.cfg, or not ? I guess it's useful when you use transformers, which I'm not doing, correct ? If I used transformers, then I should bring similar changes in this file ?

2) In case I'm sourcing the trained NER model

I should add the code for sourcing the NER model :

source_nlp = spacy.load("path/to/my/ner/model")
nlp = spacy.blank("en") # this line should already exists in the code
nlp.add_pipe("ner", source=source_nlp)

But where exactly should I add that ? Should it be called during the rel component training process, or should I train both components separately and use this code in the test_project_rel file when I'm trying to load my rel model ?

Finally, can I delete parse_data.py if I'm using parse_data_generic.py ?

Thanks