Rel training

Hello!

This is somewhat of a follow up to my previous posts, but made a separate topic for clarity.
I have seen similar topics, but still it is not really clear to me what i should do.

I have annotated and trained a NER model through Prodigy on Greek documents with custom entities. The whole process for NER was intuitive and easy to train. I think my results, considering the difficulty of my case, are satisfactory.

Afterwards, and with your help on my previous posts, i continued with the annotation for some custom Relationships using rel.manual. I gave as input my previously labeled NER dataset and annotated the relationships between them.

Now, i want to train a model from the annotation created by rel.manual. I have seen that prodigy does not support as of yet, a train -rel option, so the process is not as easy as training my NER model. I also have seen the tutorial posted on YouTube by Sofie, but have not managed to alter it and make it able to train on my data.

I have three main questions.
First, the extracted annotations from my rel.manual also have the Named Entities, but i think i have seen mentioned elsewhere that it still better to separately train the two tasks. So i should keep my already trained NER model and train a REL model on top of that?
Second, given the above scenario how to actually do that in terms of pipeline training, changing the config file appropriately is enough or does it require something more?
Third, is there a generic use case alteration in the code of projects/tutorials/rel_component at v3 · explosion/projects · GitHub
to run on my extracted annotations?

I know my questions are pretty vague and it is probably unrealistic to expect specific answers, so if you need any more info from me to make it more concrete, please let me know.
As always, i am grateful for your help!

Hi Pantelis,

In general your workflow is exactly how we'd recommend it. The REL component was developed as a tutorial and is not really robust and generic enough to cover any use-case, but I hope the provided example code can help you hit the ground running.

First, the extracted annotations from my rel.manual also have the Named Entities, but i think i have seen mentioned elsewhere that it still better to separately train the two tasks. So i should keep my already trained NER model and train a REL model on top of that?

Yes, it would be better to train the two tasks separately. Imagine having a REL component that links addresses to people names. Your NER model would recognize ADDRESS and PERSON labels. If you train your NER model on the REL data, it will receive as training data only entities that are actually in a relationship, which is highly confusing to the NER. Instead, you want your NER to be trained on consistent data: all names and addresses in your text, whether they are in a relationship or not. Then leave that second part up to the REL model.

Second, given the above scenario how to actually do that in terms of pipeline training, changing the config file appropriately is enough or does it require something more?

Let's assume you have a trained NER model on disk as part of a spaCy pipeline called ner_trained. Then you'll have a pipeline defined like this (minimally):

[nlp]
pipeline = ["ner","relation_extractor"]

[components.ner]
source = "ner_trained"

[components.relation_extractor]
factory = "relation_extractor"

[training]
frozen_components = ["ner"]

Then, there are two ways in which you link the training together.

  1. you use the gold entities from your training data in the REL training. This ensures that the training process receives "clean" instances, allowing it to learn "better", but it may also result in a REL model that is less robust against mistakes by the NER. This is what the tutorial code does. It does it sneakily by setting the gold entities on the predicted doc in the reader that is used during training: projects/custom_functions.py at v3 · explosion/projects · GitHub

  2. you use the predicted entities (from your previously trained NER model) during REL training. This will make the training data slightly more noisy, but could result in more robust performance on realistic data, because the REL will have to deal with wrong entities at some point (you'll never have 100% accuracy). The example code is set up to do that, because it defines the instances from eg.predicted. You do need one more addition to your config file for this:

[training]
frozen_components = ["annotating_components"]

Which ensures that even though the NER is frozen (i.e. not being updated), it will still run (before the REL) and set its predictions on eg.predicted, so the REL model can use those.

Third, is there a generic use case alteration in the code of projects/tutorials/rel_component at v3 · explosion/projects · GitHub
to run on my extracted annotations?

We've got in our roadmap to work on a more robust & generic version of the REL model, but I'm afraid this is not currently available and it won't be in the near future.

Hope this helps!

1 Like

Thank you Sofie, for your detailed answer!

It is reassuring to know my train of thought was on the right track.
I appreciate your suggestions and i am currently trying to fiddle with the example to see what works best for me.
So far, i edited the config file (according to your advice) and made some initial changes on the other .py files. My first training experiment resulted in lots of Could not determine any instances in doc and really bad RelEx results.
I am also searching the SpaCy Discussions forum where there is a lot of useful information as well and i saw there, that the above warning/error occurs due to max_length and batch_size issues.

If i have any more concrete questions i will get back to this. I am curious whether you have any other suggestions regarding Relation Extraction (or Coreference Resolution) approaches in general.
I hope both Prodigy and SpaCy continue to evolve and make modern NLP research even more approachable than they already have!

With gratitude,
Pantelis

Hi, I am trying to do a similar thing. I am training a NER component and have the model saved in "model-best" folder. Then I am trying to run the relation extraction based on the pre train model. I have a hard time figuring out how should my config file look like?

I changed the rel_tok2vec.cfg file as in the instruction and I get the error:

ValueError: [E143] Labels for component 'relation_extractor' not initialized. This can be fixed by calling add_label, or by providing a representative batch of examples to the component's `initialize` method.

Even though I added:

initialize.components.relation_extractor]

[initialize.components.relation_extractor.labels]
@readers = "spacy.read_labels.v1"
path = "data\\spancat.json"

hi @korneliaB!

Can you try to run:

python -m debug config ./configs/rel_tok2vec.cfg

This will give debugging details on your config file.

Can you see this spaCy Discussion Post:

The user had the same [E143] error and found that:

During the saving on my training data in the DocBin format, I wasn't using the parameter store_user_data=True so my relations were not being saved.

If this doesn't solve your problem, can you post your question on the spaCy Discussion forum? This forum is for Prodigy and your question is more on spaCy.

The spaCy core team answers questions on that Discussion forum so likely they may be able to help you on debugging your problem. Be sure to look over the Discussion FAQ/best practices in posting before posting.