Is there any recipes to train a relation-extraction model?

From this topic, it says the relation/dependency between two entities can be annotated as a label:

{
    "text": "entity A, some text in between, entity B"
    "spans": [{"start": 0, "end": 8, "label": "A"}, {"start": 32, "end": 40, "label": "B"}],
    "label": "RELATED"
}

Suppose that we had some annotated samples like this, how can i make a model to predict relations and which model(or recipe) should i adopt?
it’s kinda emergency for us, hope for your reply as soon as possible.
@honnibal

Hi, sorry for the late reply! There’s currently no built-in recipe to do this – but we’re working on it. You can also easily write your own! Here are some resources that should be helpful:

  • Workflow: Custom Recipes: everything you need to know about custom recipes, including examples. For more details, you can also check out the respective section in your PRODIGY_README.html.
  • Code example: Training an intent parser with spaCy: This example shows how you can train spaCy’s parser to predict any type of tree structure over your text (not just syntactic relations) – in this case, finding local businesses and their attributes.

To annotate and correct a model’s predictions with Prodigy, all you need is a spaCy model that predicts something, even if it’s not that good yet. If you want to train with a model in the loop, you can write two functions: one that takes a stream of examples, predicts the relationship, assigns a score and yields (score, example) tuples, and one function that takes a list of examples and updates the model.

You can also collect annotations in a more static way without a model in the loop. This is even easier – all you need to do is create an iterable stream of examples and return it by your recipe. So let’s say you’ve pre-trained a spaCy model using the intent parser example above. You could then write a custom recipe that looks something like this:

import prodigy
from prodigy.components.loaders import JSONL  # or any other loader
import spacy  # use spaCy to load the model and get relations

@prodigy.recipe('relation-extraction')
def relation_extraction(dataset, model, source):
    stream = JSONL(source)  # assuming your text source is a JSONL file
    nlp = spacy.load(model)  # load the model from a path or package

    def get_stream(nlp, stream):
        for task in stream:  # iterate over the examples
            doc = nlp(task['text'])   # process the text with your model
            # let's get all dependencies for PLACE and their heads
            relations = [(t.text, t.dep_, t.head.text) for t in doc 
                         if t.dep_ == 'PLACE']
            for child, relation, head in relations:  # create one task for each relation
                # this will show an annotation task with a headline "PLACE"
                # and text like "find → hotel"
                # you could also use spaCy to extract token boundaries and create "spans"
                yield {'text': "{} → {}".format(head, child), 'label': relation}

    return {
        'dataset': dataset,  # the dataset to save annotations to
        'stream': get_stream(stream)  # the stream of examples
    }

You can then use your recipe like this:

prodigy relation-extraction my_dataset /path/to/model my_text.jsonl -F recipe.py

This will start the server and present you examples for annotation, which you can then accept and reject. You can also be more creative with the way you present the task – for example, you could use spaCy to extract the start and end characters of the individual tokens, and highlight them in the text by adding a "spans" property to the task – for example, [{"start": 0, "end": 7}, {"start": 20, "end": 28}].

@honnibal is currently travelling, but I’m sure he’ll also have some additional tips for you once he’s landed :wink:

Thank you, i will have a try under these instructions. Anyway, prodigy is a excellent product.

1 Like

Another question:
I am working on implementing a custom-multi-class-text-classifier recipe, it is still not clear for me how to write a predict() function.
In my opinion, predict(samples) recieves text samples, then predicts labels of sample, fills relevant fields (e.g. label and score), and finally returns some formatted values. As for multi-class-classifier, what should the formatted return value look like (e.g. label, score or other required fields)?
Same question occurs to me when i get close to update().
@ines

Yes, exactly! For example, let’s assume your input data looks like this:

texts = ["This is a text about cats", "This is about dogs"]

And let’s assume you also have a model that can predict labels and their scores based on input documents. You can then write a predict function that yields (score, example) tuples. Each example is a dictonary in Prodigy’s format, containing a text and a label, and optional meta information (which is displayed in the bottom right corner in the web app):

def predict(texts):
    for text in texts:
        # predict labels and scores for the text, for example:
        # [('CAT', 0.91), ('DOG': 0.53)]
        labels = YOUR_MODEL.predict(text)  
        for score, label in labels:  # create one task for each label!
            example = {'text': text, 'label': label, 'meta': {'score': score}}
            yield (score, example)  # tuples of score and annotation task 

The (score, example) tuples can be consumed by Prodigy’s sorters – for example, prefer_uncertain(), which will resort the stream to prefer examples with a score closest to 0.5. To make annotation more efficient, the web application focuses on one decision at a time – so you can simply create one annotation task for each label.

So in the example above, Prodigy might skip the very confident CAT prediction and only ask you whether the label DOG with a score of 0.53 applies to “This is a text about cats”. You can then click reject, and the annotation is saved to your dataset.

Your update() function takes a list of annotated examples, i.e. the same examples as your input, just with an added "answer" key that’s either "accept", "reject" or "ignore".

def update(examples):
    right = [eg for eg in examples if eg['answer'] == 'accept']
    wrong = [eg for eg in examples if eg['answer'] == 'reject']
    # update your model with the right and wrong examples
    YOUR_MODEL.update(right, wrong)

All annotations will include the text, a label, the score and whether the label was correct or incorrect (according to the human annotator). So you can use this information to update your model.

Putting this all together, a recipe could then look like this:

import prodigy
from prodigy.components.sorters import prefer_uncertain

@prodigy.recipe('custom-textcat')
def custom_textcat(dataset):
    texts = [...] # your data here
    return {
        'dataset': dataset,
        'stream': prefer_uncertain(predict(texts)),
        'update': update,
        'view_id': 'classification'  # use classification interface
   }

This will stream in your examples, resort them to prioritise the ones with an uncertain score. In the web app, and show them in the web app, you will see a label and a text. Every time the app sends back a batch of annotated tasks to the Prodigy server, your update() function will be called, and your model will be updated. All annotations you collect will be stored in the dataset, so you can export it and train from the annotations later on. You can also find more info on the exact data formats and expected API of the recipe components in your PRODIGY_README.html.

a basic question about prodigy:
will predict(data_stream) execute after each call of update(examples)?
I just debug the whole process of a multi-class text classification task, it seems that predict() is only called once, which means labels of samples are never updated by the new model. It really puzzle me a lot.
@ines

Sorry, maybe this wasn't clear enough in my answer: predict() returns a generator, i.e. a stream that yields examples. This means that when update() changes the model's state, it will be applied to the new examples yielded by the stream.

I am so sorry, i just found my mistake.

It seems that a total of 64 samples are predicted by predict() during the first round of get_questions(). Can i modify this number?

The number 64 isn’t defined anywhere – it’s how many examples the model needed to predict in order to get one full batch (by default, 10 examples) of questions with a relevant score. If you’re using the prefer_uncertain sorter, this means scores closest to 0.5. Examples with very high or low predictions will be skipped.

Btw, you can specify the size of one batch that’s sent out to the app via the "batch_size" setting in your prodigy.json, or in the 'config' setting returned by your recipe.

Very detail, thank you very very much.

Thanks @ines for your guidance! As this post is one year old, I just wanted to make sure that the approach described here is still the “newest” one or whether there have been already some improvements introduced to the framework in the meantime?

@nadworny The general idea is still valid – but Prodigy now also has a dep interface which gives you a better way to present the relations visually. See here for the demo.

The data format looks like this and uses the token indices to define the head and child:

{
    "text": "Hello Apple",
    "tokens": [
        {"text": "Hello", "start": 0, "end": 5, "id": 0},
        {"text": "Apple", "start": 6, "end": 11, "id": 1}
    ],
    "arcs": [
        {"head": 1, "child": 0, "label": "compound"}
    ]
}

Note that you can currently only render one dependency at a time, so just like in the example above, you can create one task per relation.

1 Like