I would like to train a custom dependency parser to extract relations between tokens.
I have a collection of 29k annotations in the form (all character offsets are relative to the beginning of the document):
{"text": "Xanax 0.25 mg",
"tokens": [{"text": "0.25 mg", "start": 14661, "end": 14668, "id": 104},
{"text": "Xanax", "start": 14655, "end": 14660, "id": 103}],
"arcs": [{"head": 104, "child": 103, "label": "Strength-Drug"}]}
{"text": "Xanax 0.25 mg 1-2 tabs prn\nalbuterol MDI\nIbuprofen prn\n",
"tokens": [{"text": "1-2", "start": 14669, "end": 14672, "id": 105},
{"text": "Xanax", "start": 14655, "end": 14660, "id": 103}],
"arcs": [{"head": 105, "child": 103, "label": "Dosage-Drug"}]}
My aim is to train a custom model for relation extraction. I uploaded the jsonl file to Prodigy db
:
When I pass it to dep.batch-train
I receive this error. Any help please?
Should I try to fit first spaCy model and then fine-tune with Prodigy? What are best practices to train a relation extraction model in general?
Also, I’m a bit puzzled by the figures:
While I have 29032 accepted examples, 20% are 2004 and 100% remaining are 8019. How they are computed?