Relationship between named entities

I'm having a lot of fun exploring prodigy for a text mining task. Training a NER model using annotations is extremely rewarding, and works surprisingly well with only a small amount of manual annotations.

So far, I've been using prodigy train ner ... for assigning entities. Let's say I have two entity types, FIRM and TECH. Some data to illustrate my task:

[IBM][FIRM] has identified [hybrid cloud][TECH] as the growth area it will focus on. 

In early March 2017, [Snapchat][FIRM] officially went public. The company is one of the key players in [augmented reality][TECH] apps, and continues to invest in [computer vision][TECH] research.

[Amazon][FIRM] offers the most diverse offerings in [cloud computing][TECH]. [Microsoft][FIRM] offers similar solutions, and has ...

Ideally, I would like to extract (across sentence boundaries, if possible, see example #2, and with multiple associations, see example #2 and #3), the following relationships:

Simple relation:
[IBM] -> [hybrid cloud]

One-to-many FIRM
[Snapchat] -> [augmented reality] and [Snapchat] -> [computer vision]

One-to-many TECH
[Amazon] -> [cloud computing] and [Microsoft] -> [cloud computing]

I've tried to use dependency parsing, but with many sentences having a complex structure I fail to capture a significant amount of relations. I've started annotating with prodigy rel-manual. Relations are always in the direction FIRM->TECH (with one-to-many both ways).

My questions:

  1. Is this approach theoretically feasible to extract the relations between FIRM and TECH entities, with annotations and training yielding better results than dependency parsing?

  2. Is coreference better suited for this specific task (directionality of relation is not necessarily needed, since the hierarchy from FIRM->TECH is clear)?

  3. Since there is currently no prodigy train ner equivalent for training a model with custom relations, is there currently a way to try out if this works?

Thank you very much for any pointers in advance, and I'm looking forward to explore prodigy further!

2 Likes

Hi, happy to hear you're having fun! :wink:

It sounds like you've got a good workflow going for the Named Entities, so it's probably a good idea to keep that as-is. Basically you'd then be predicting the relations on top of a text where you assume to already know the entities. For the relation annotation, you can thus also assume you know your entities, mark them in the text, and disable all other content from the sentence to get a clean annotation interface:

Your input file would look something like this:

{"article_id":1,"text":"IBM has identified hybrid cloud as the growth area it will focus on.","spans":[{"start":0,"end":3,"label":"FIRM"},{"start":19,"end":31,"label":"TECH"}]}
{"article_id":2,"text":"In early March 2017, Snapchat officially went public. The company is one of the key players in augmented reality apps, and continues to invest in computer vision research.","spans":[{"start":21,"end":29,"label":"FIRM"},{"start":95,"end":112,"label":"TECH"},{"start":146,"end":161,"label":"TECH"}]}
{"article_id":3,"text":"Amazon offers the most diverse offerings in cloud computing. Microsoft offers similar solutions.","spans":[{"start":0,"end":6,"label":"FIRM"},{"start":44,"end":59,"label":"TECH"},{"start":61,"end":70,"label":"FIRM"}]}

Note that in each line, the variable text may contain multiple sentences. It probably makes sense to break up your original texts in paragraphs, if you're expecting a lot of relations across sentence boundaries.

Then you could write disable patterns like this:

{"label":"disable","pattern":[{"_": {"label": {"NOT_IN": ["TECH", "FIRM"]}}}]}

Then you can run

python -m prodigy rel.manual my_db en_core_web_lg text.jsonl -l "REL" -dpt patterns.jsonl

I do think that in practice, you really only have simple binary relations from FIRM to TECH. In this screenshot, there are two such binary relations. For your third sentence, you'd also have 2 single relations, just involving the same entities.

In the Prodigy UI, you can hide the arrows if you think that makes it more intuitive for the annotator. The relations will then look like they are symmetrical, but note that in the underlying exported data, you will still have an explicit direction encoded in the JSON. It depends on how you'll use the data downstream whether that directionality is used or not, or whether it's discarded in parsing.

You're also asking how this challenge relates to dependency parsing & coreference resolution. I would say that dependency parsing is a very specific type of relation extraction, predicting trees over sentences. In theory you could cast your TECH/FIRM challenge as a challenge for the dependency parser, but I think it'll require quite some fiddling to get that to work properly. Coreference resolution is another very specific relation-extraction task, where the entities are often pronouns and noun phrases, and multiple entities all refer to the same thing. It's really a different task - you can't just use a coref model for the FIRM-TECH challenge.

However, speaking about coref, in your example sentences coreference resolution could find that "IBM" and "it" are the same entity, and perhaps even that "similar solutions" refers to "diverse offerings in cloud computing" in the third sentence. So in theory you could use the output of a coreference model as (additional) input for your relation extraction model. But getting an accurate coref model is a challenge.

We don't currently have a general-purpose relation-extraction model available in Prodigy or in spaCy, but I would imagine that several solutions exists for your challenge of simple relations between two entities.

3 Likes

Thank you very much for your detailed and informative reply.

Using disable patterns is certainly a great idea, and makes annotating very simple.

Let's say there are a couple hundred / a few thousand annotations for this challenge – (un)directed relationships between FIRM and TECH. As for my last question...

Since there is currently no prodigy train ner equivalent for training a model with custom relations, is there currently a way to try out if this works?

How would I currently go ahead and train a spacy model for relations similar to NER (i.e., prodigy train ner)? As far as I know, there is no analogous prodigy train * for relations – What are my options to give this a go and see if the results of the trained model in terms of accuracy (precision and recall) are promising?

Thanks again for your pointers, and looking forward to hear your thoughts! :slight_smile:

Hello,

thanks for raising the question @dc17 . I am currently facing the same question.

@SofieVL, would you have any recommendations on how to build on the annotations obtained with the rel.manual recipe? At this point, I find it hard to capitalise on the recipe/annotations.

What's the best option in the spirit of spaCy/prodigy? What kind of relation prediction model should we choose? What kind of development framework (pytorch, thinc)? And then, should it be added to a spaCy model through a custom pipeline?

I think that many people who are currently experimenting with the relationship recipe would be very interested in the recommended workflow to make the best of the resulting annotations.

Last but not least, is there any plan for a "universal relationship prediction" extension as part of spaCy/prodigy in the future ? If yes, any timeline you could communicate? That would help a lot planning our workflow.

Thanks in advance,

Cyril

3 Likes

Hi @dc17,

Yes you can exchange the component in the train comand: prodigy train parser

1 Like

This is a difficult question to answer, because it's not really trivial to recommend a generic approach for relation extraction that is bound to work in all domains.

However, we did recently create an example implementation for a REL component as a spaCy 3 project. You can find it here:

It contains a ML model implemented from scratch in Thinc, as well as a new trainable pipeline component rel that you can add to your pipeline. It requires the annotations to be encoded in the doc._.rel custom attribute. More background information can be found in the docs here: https://nightly.spacy.io/usage/layers-architectures#component-rel (still a bit under construction right now)

Note in particular that the project example contains a config file for training the new model on a CPU with a Tok2Vec layer, as well as an alternative to use a Transformer (on a GPU).

This approach should really be regarded as a first baseline approach, and it's currently run on a very small toy dataset with biomolecular interactions. You can check it out and play with it though, replace the data with your own annotations, perhaps add more features and/or tune the model architecture & parameters. Have fun experimenting :wink:

2 Likes