data-to-spacy for rel_component training

Hi,

I am using
python -m prodigy rel.manual ner_rel blank:en transcripts/xxx.txt --label LABEL1,LABEL2 --span-label LABEL1,LABEL2

recipe to annotate the dataset for relationship extraction

I am trying to use this dataset to train the rel_component using the instructions here:

but I am stuck ah how do I convert the .jsonl dataset that I get into training.spacy and dev.spacy

Hi @korneliaB !

You should be able to use data-to-spacy for this purpose. You can use the --parser parameter to achieve that. Something like this:

prodigy data-to-spacy ./corpus --parser <my-dataset> 

Thank you!

It worked, but now I get the following error:

ValueError: [E143] Labels for component 'relation_extractor' not initialized. This can be fixed by calling add_label, or by providing a representative batch of examples to the component's `initialize` method.

I run the same code on the data I prepared last year in a different annotation tool (UBIAI) and it works just fine, so I am certain there is something wrong with the data set here.

Hi @korneliaB ,

Ok, let us step back for a bit. I realized that since you already have the labeled documents in Prodigy, you can export them into .jsonl using the db-out command, then
reuse / modify this parse_data.py script to convert the JSONL files into the spaCy format.

The reason why it errored out is because it expects some labels before the component is initialized. You can see this being done in the main function. So you have to do something like:

python scripts.parse_data path/to/json path/to/train.spacy path/to/dev.spacy path/to/test.spacy

If you're using your own dataset, you might need to adjust the parsing process. But a good first step would be to try this script out in your own exported JSONL files.

This is a good article on that. Maybe the linked resources will help you.

1 Like