I am using
python -m prodigy rel.manual ner_rel blank:en transcripts/xxx.txt --label LABEL1,LABEL2 --span-label LABEL1,LABEL2
recipe to annotate the dataset for relationship extraction
I am trying to use this dataset to train the rel_component using the instructions here:
but I am stuck ah how do I convert the .jsonl dataset that I get into training.spacy and dev.spacy
You should be able to use
for this purpose. You can use the
--parser parameter to achieve that. Something like this:
prodigy data-to-spacy ./corpus --parser <my-dataset>
It worked, but now I get the following error:
ValueError: [E143] Labels for component 'relation_extractor' not initialized. This can be fixed by calling add_label, or by providing a representative batch of examples to the component's `initialize` method.
I run the same code on the data I prepared last year in a different annotation tool (UBIAI) and it works just fine, so I am certain there is something wrong with the data set here.
Ok, let us step back for a bit. I realized that since you already have the labeled documents in Prodigy, you can export them into
.jsonl using the
to convert the JSONL files into the spaCy format. reuse / modify this parse_data.py script
The reason why it errored out is because it expects some labels before the component is initialized. You can see this being done in the
main function. So you have to do something like:
python scripts.parse_data path/to/json path/to/train.spacy path/to/dev.spacy path/to/test.spacy
If you're using your own dataset, you might need to adjust the parsing process. But a good first step would be to try this script out in your own exported JSONL files.
August 21, 2022, 12:46pm
This is a good article on that. Maybe the linked resources will help you.