Hi @stella!
Yes, this is exactly the setup Sofie does. She explicitly says from the beginning she's going to assume she already has a trained ner
component.
Yes! Sofie used Thinc
for training. You can see the training code here and she carefully explains the code in 8:11 to 18:30 the Thinc model script. She then describes around 22:55 an Overview of the TrainablePipe API and how to implement the custom component. You may not need to know all of the details and can luckily leverage a lot of the project she developed.
I've tried my best to simplify the steps you'd need to do to train your relations component:
1. Clone the rel_component
tutorial
python -m spacy project clone tutorials/rel_component
This step assumes you already have spaCy installed, ideally in a fresh virtual environment.
2. Replace annotations with your data
The simplest approach would be to db-out
your relations annotations and replace assets/annotations.jsonl
with your new file.
3. Modify parse_data.py
based on your unique labels
This is likely the toughest step as you'll need to modify her code a bit.
Unfortunately, there isn't a data-to-spacy
command for relations as described since there isn't a spaCy component for training relations:
However, Sofie created the script parse_data.py
. You may need to simply modify the code, which is similar to this post:
Also, at the end of that post, here's a related post that has an example of someone who modified parse_data.py
too:
This user used a different tool for annotation, but then modified the parse_data.py
code (see here) to convert the data into a .spacy
format. Hopefully that'll give you enough to modify the script.
If you're still having difficulty, please provide an example of your data and your attempted script. We can then help coach you.
4. Train model
I recommend starting with cpu
. So if your data is in .spacy
format in the data
folder (like what parse_data.py
does), you can then run spacy train
.
Since this project is setup as a spaCy project, if everything is set up, you can then run:
python -m spacy project run train_cpu
This is essentially just running spacy train
(see project.yml
):
python -m spacy train ${vars.tok2vec_config} --output training --paths.train ${vars.train_file} --paths.dev ${vars.dev_file} -c ./scripts/custom_functions.py
The ${xxx}
are spaCy project variables that are specified in the project.yml
, e.g., ${vars.tok2vec_config}
points to this file:
Recommend to run existing project before on your own data
Before beginning on your own data, I would recommend to run the sample project first on Sofie's data to make sure you have set up everything correctly (e.g., correct spaCy version). This would just two commands:
# assume you have spacy in activated venv
python -m spacy project clone tutorials/rel_component
# this will run the default parse_data, train_cpu, and evaluate commands
python -m spacy project run all
By running just these two commands, you should be able to rebuild Sofie's trained relation extraction component. She discussed this part in 32:10 - 34:00 more in detail, including some background on spaCy projects.
More advanced: adding in transformer
Once you get the cpu running, then I would recommend following Sofie's instructions on training with a transformer. The big difference is you'll need spacy-transformers
installed and be running the transformer config. I highly recommend watching the video from 34:39 - 37:15 where Sofie discusses more about using the transformer for training.
Hope this helps!