Training a relation extraction component

Hi @stella!

Yes, this is exactly the setup Sofie does. She explicitly says from the beginning she's going to assume she already has a trained ner component.

Yes! Sofie used Thinc for training. You can see the training code here and she carefully explains the code in 8:11 to 18:30 the Thinc model script. She then describes around 22:55 an Overview of the TrainablePipe API and how to implement the custom component. You may not need to know all of the details and can luckily leverage a lot of the project she developed.

I've tried my best to simplify the steps you'd need to do to train your relations component:

1. Clone the rel_component tutorial

python -m spacy project clone tutorials/rel_component

This step assumes you already have spaCy installed, ideally in a fresh virtual environment.

2. Replace annotations with your data

The simplest approach would be to db-out your relations annotations and replace assets/annotations.jsonl with your new file.

3. Modify parse_data.py based on your unique labels

This is likely the toughest step as you'll need to modify her code a bit.

Unfortunately, there isn't a data-to-spacy command for relations as described since there isn't a spaCy component for training relations:

However, Sofie created the script parse_data.py. You may need to simply modify the code, which is similar to this post:

Also, at the end of that post, here's a related post that has an example of someone who modified parse_data.py too:

This user used a different tool for annotation, but then modified the parse_data.py code (see here) to convert the data into a .spacy format. Hopefully that'll give you enough to modify the script.

If you're still having difficulty, please provide an example of your data and your attempted script. We can then help coach you.

4. Train model

I recommend starting with cpu. So if your data is in .spacy format in the data folder (like what parse_data.py does), you can then run spacy train.

Since this project is setup as a spaCy project, if everything is set up, you can then run:

python -m spacy project run train_cpu

This is essentially just running spacy train (see project.yml):

python -m spacy train ${vars.tok2vec_config} --output training --paths.train ${vars.train_file} --paths.dev ${vars.dev_file} -c ./scripts/custom_functions.py

The ${xxx} are spaCy project variables that are specified in the project.yml, e.g., ${vars.tok2vec_config} points to this file:

Recommend to run existing project before on your own data

Before beginning on your own data, I would recommend to run the sample project first on Sofie's data to make sure you have set up everything correctly (e.g., correct spaCy version). This would just two commands:

# assume you have spacy in activated venv
python -m spacy project clone tutorials/rel_component
# this will run the default parse_data, train_cpu, and evaluate commands
python -m spacy project run all 

By running just these two commands, you should be able to rebuild Sofie's trained relation extraction component. She discussed this part in 32:10 - 34:00 more in detail, including some background on spaCy projects.

More advanced: adding in transformer

Once you get the cpu running, then I would recommend following Sofie's instructions on training with a transformer. The big difference is you'll need spacy-transformers installed and be running the transformer config. I highly recommend watching the video from 34:39 - 37:15 where Sofie discusses more about using the transformer for training.

Hope this helps!

2 Likes