Custom dependency training (IndexError: list assignment index out of range)

I would like to train a custom dependency parser to extract relations between tokens.

I have a collection of 29k annotations in the form (all character offsets are relative to the beginning of the document):

{"text": "Xanax 0.25 mg", 
 "tokens": [{"text": "0.25 mg", "start": 14661, "end": 14668, "id": 104}, 
            {"text": "Xanax", "start": 14655, "end": 14660, "id": 103}], 
 "arcs": [{"head": 104, "child": 103, "label": "Strength-Drug"}]}

{"text": "Xanax 0.25 mg 1-2 tabs prn\nalbuterol MDI\nIbuprofen prn\n", 
 "tokens": [{"text": "1-2", "start": 14669, "end": 14672, "id": 105}, 
            {"text": "Xanax", "start": 14655, "end": 14660, "id": 103}], 
 "arcs": [{"head": 105, "child": 103, "label": "Dosage-Drug"}]}

My aim is to train a custom model for relation extraction. I uploaded the jsonl file to Prodigy db:

Capture

When I pass it to dep.batch-train

Capture_1

I receive this error. Any help please?

Should I try to fit first spaCy model and then fine-tune with Prodigy? What are best practices to train a relation extraction model in general?

Also, I’m a bit puzzled by the figures:

While I have 29032 accepted examples, 20% are 2004 and 100% remaining are 8019. How they are computed?

I think there's a fairly straight-forward problem here, and also a couple of deeper issues.

I'm pretty sure that your indices are off. You say:

I have a collection of 29k annotations in the form (all character offsets are relative to the beginning of the document):

That's not the correct format for Prodigy: Prodigy expects that each line of the jsonl file has character offsets relative to the text. So the first token of a document will always start at character 0. I think that's why you're seeing the index error, and it could also explain the unexpected data sizing.

The deeper issue is that I'm not sure how well the parser will learn your relations. In theory it's possible, but the parser is trying to produce a fully connected tree over the inputs, so there will be a lot of under-specified arcs.

Are you sure this is the easiest way to do what you need? For instance, I think regular expressions or matcher rules might be a better fit for doing something simple like the dosage and strengths. You could at least use rules to detect the easy cases.

Hi @honnibal, thanks and I can see my problem. I mistakenly thought that it is enough to create ids of tokens and then link them with arcs. I’ll re-write the annotations.

I’ve have a nice working pipeline, similar to what you suggested. However, I have 7 entities and 6 ‘arcs’. At the moment, I trained (spacy+prodgy) a good NER model and I can parse the tree to find children with specific labels, but given a relatively large annotated relations data set I hoped to learn the dependencies (similar to custom intent parse tree in spacy’s example).

Is it possible to augment the existing parser with a custom relations (kind of extensions for example)?

Thanks anyway!

You can’t really augment the existing parser, no. Every word has to have exactly one head and one label. The syntax already has answers for all of those, so you can’t really update with new labels.

You can try the approach in the custom intent parse tree example, and see how you go. Otherwise you could also try looping over the word pairs, and running a classifier on them.