Text to Knowledge Graph - Prodigy+Spacy

We are trying to figure out an efficient way to convert text to Knowledge Graph. Is there any existing pipeline or best practices available to do this? Is there any better alternative than below proposed approach?

This is our proposed approach:
Going through the prodigy forum, we find clubbing multiple annotation tasks is not recommended. So, we are setting up three prodigy instances.

Prodigy for NER(to train and identify our custom entities)
|
V
Prodigy for Relationship extraction
|
V
Prodigy for Entity linking (with our custom Ontology)

The above pipeline is used to generate RDF(entity>relation>entity) triplets which will be loaded to a GraphDB.

1 Like

Hi,

Yes, that approach sounds very reasonable! As these are three very different (and often complex) tasks, it definitely makes sense to run then in sequence, to allow annotators to focus on one specific task at the time.

Just as a little tip: When you start off with your project, I would start by annotating a small portion of your corpus with all three tasks first. This will help you see whether your annotation guidelines of the first steps support the annotations you want to make in the last steps. You don't want to go and annotate all your corpus with NER, just to figure out that for REL or EL you should have done the NER slightly differently.

4 Likes

@himeshph There are existing packages for this. In general the task is called Open Information Extraction (OpenIE). The most famous python tool for this problem is Stanford's OpenIE project. It works moderately well, but not perfectly. It tends to be high recall but low precision. OpenIE is a very, very hard task.

That said, the existing methods aren't entirely satisfactory. So I've also been wondering about how to use Prodigy for labeling the Subject-Verb-Object (SVO) triples that are used in Knowledge Graphs and other Information Extraction tasks.

I would love to see a Prodigy annotation mode designed for this task. The resulting labels for a text would be a set of SVO triples. I think this annotation mode would get a lot of use, because extracting information in the form of SVO triples is a very common task, particularly in biomedical NLP.

You could probably achieve something like this using rel.manual and the relations UI, plus some patterns and a model to identify candidates or even automate some of the annotations.

If you have a decent part-of-speech tagger, identifying the verbs should be pretty easy. You can also merge noun phrases and use the disable patterns to disable all other tokens you know are never going to be part of your SVO triples (punctuation etc.). Then all that's left to do is attaching the subjects and objects to th verbs.

You can even take this one step further and use the dependency parse here: Linguistic Features · spaCy Usage Documentation This should already give you the subject/verb/object relationships you're interested in, e.g. by checking for nsubj or dobj dependency labels. So you could also use the dep.correct workflow with a trained dependency parser and only those labels.