How to extract dependencies in spaCy after using prodigy rel.manual?

:wave: I'm trying to develop a model using NER and relation extraction with prodigy and had a usage question.

To start off, I generated a JSONL dataset that contained a few thousand sentences and were pre-labelled with the 3 spans I want to relate together (I retrieved their offsets using PhraseMatcher). I then ran:

$ prodigy rel.manual ner_exp_restr_dep en_core_web_lg ./output.jsonl \
   --label HAS_COSTS,IN_YEAR \
   --span-label EXPENSE,MONEY,DATE \
   --add-ents \
   --wrap

and spent about 30 minutes annotating 100 or examples with relation data. The basic idea is that a EXPENSE span relates to a MONEY span, which relates to a DATE span. After saving this to the DB, I ran:

$ prodigy train rel en ner_exp_restr_dep

which exported to a local directory. I then imported the model with:

import spacy

nlp = spacy.load("./rel/model-last")

doc = nlp("In 2020 we recorded $20 million in impairment charges")

for ent in doc.ents:
    print(ent.text, ent.label_)

# 2020 DATE
# $20 million MONEY
# impairment charges EXPENSE

there doesn't seem to be a way to map from EXPENSE -> MONEY -> DATE.

How do I map from one entity to another using the relations extracted in prodigy? I didn't see anything in the docs about the next steps required.

Hi! I'm kinda surprised train rel worked and didn't raise an error because there's no component to train here :thinking: Actually, I just remembered that you were the same person who disabled use_plac as a workaround on this thread – this is likely the problem because it skips CLI argument validation. So I'd recommend turning that back on and using the other workaround I provided in the thread.

There's currently no built-in component for relation extraction in spaCy, so you will have to use your own implementation, depending on how you want your relation extraction to work. Here are some related threads:

For spaCy v3, @SofieVL recorded this in-depth tutorial on how to implement an entity relation extraction component from scratch. The code for this is available as a spaCy project so you can experiment with it. Even if you want to do something more custom, the video has a lot of helpful pointers on how to model the problem:

1 Like

Thanks @ines, those links were super helpful. I tried cloning that tutorials/rel_component project, and then ran:

spacy project run all

to process the data and train an initial model. But when I ran this script from the same directory (using the first sentence in assets/annotations.jsonl), it didn't find any of doc._.rel fields:

import spacy
from scripts.rel_pipe import *
from scripts.rel_model import *

nlp = spacy.load("training/model-best")

doc = nlp("Furthermore, Smad-phosphorylation was followed by upregulation of Id1 mRNA and Id1 protein, whereas Id2 and Id3 expression was not affected.")

print("spans", [(e.start, e.text, e.label_) for e in doc.ents])

for value, rel_dict in doc._.rel.items():
    print(f"{value}: {rel_dict}")

# β„Ή Could not determine any instances in doc - returning doc as is.
# spans []

I tried instantiating a Doc object directly (like in evaluate.py), but it doesn't help either:

words  = ['Luciferase', 'assays', 'revealed', 'a', 'approximately20-fold', 'increased', 'transcriptional', 'activity', 'of', 'the', '1025', 'bp', 'sequence', 'as', 'compared', 'to', 'the', 'empty', 'vector', ',', 'indicating', 'that', 'we', 'had', 'identified', 'an', 'active', 'A3', 'G', 'promoter', 'sequence', '(', 'Figure', '3B', ')', '.'] 
spaces = [' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', '', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', '', ' ', ' ', ' ', '', ' ', '', '', '']
doc    = Doc(nlp.vocab, words, spaces)

# spans []
# 

I'm wondering if I'm missing something basic here :thinking: The model seems trained, I'm importing it with spacy.load but it's not finding any of the labels or entities.

Does your model have an entity recognizer? The relation extraction component requires named entities and in the tutorial, Sofie uses gold-standard entities as the input for simplicity. But if your doc doesn't have any doc.ents, the relation extraction won't have any entities to choose from and predict over.

Ah, I see. I had assumed that both the entities and the relations were defined in assets/annotations.jsonl and that running project run all would handle both the entity and relation extraction learning. Does it only do the former?

If so, is there a way to plug in my own entity recognizer model here? I'm hoping that if the relation extractor is generic enough, I could just bring my own NER model and JSONL file, and stuff will appear in doc._.rels.

Hi Jamie,

You could in principle train a blank NER model and REL model from scratch within the same pipeline, by adding an ner component to your training config.

A word of warning though. If your annotation has been focusing on getting the relations right, this might not be the best data to train the NER model on. Ideally, you'd want the NER model to be generic enough to pick up all mentions of the entities you're interested in - not just those that are also expressed as being in a relation.

Because you annotated the dataset with rel.manual, you'll only be presented with sentences that have at least 2 entities in them, because a relation isn't possible otherwise. This might bias your NER model if you're only training on the entities in these sentences. This is why it might often make sense to train your NER separately from your REL model.

1 Like

@SofieVL That's great context, thank you! I'll give it a go.

One last question: is it possible for the relation extraction process to understand and parse relations for different types of entity? Or is it best for the model to be trained against a single named entity? I'll give an example:

I paid $2 (MONEY) for an apple (FRUIT) yesterday (DATE)

I paid $100 (MONEY) for Lego (TOY) last week (DATE)

Although FRUIT and TOY are different entities (and will have their own NER models), their relationship to other gold-standard entities (in this case MONEY and DATE) are the same.

I'm thinking I could use prodigy to generate a JSONL file that references both types of entities and use spacy project run all against that, but didn't know if adding more label types would reduce the accuracy.

It depends on the exact implementation of your relation extraction component, but in general I think it would be beneficial to learn both these examples at the same time, even if they pertain to different entity types.

The experimental REL code that Ines linked earlier in this thread, doesn't currently take the entity types as features for relation extraction, so that shouldn't be a problem. The semantics of the relationship is the same, which is the most important bit for Machine Learning.