Customizing prodigy for NER and relationship extraction

Hello,
My group and I are interested in doing NER and relationship extraction with custom entities and relations types.
We looked into your annotation system and we have a few questions:

  1. Is it possible to integrate our NER model into your active learning process (it is implemented in python)?
  2. In the demo, it seems that a user can only approve or reject a NER tag. Is it possible for the user to select the correct NER tag in case he rejects the system suggestion?
    3)Is it possible to use your system also for gathering ground truth for relationship extraction? a relationship can be between two entities that appear in different sentences. Is it possible to display more than one sentence when validating a relationship with a user?

Thanks in advance,
Boaz

Thanks for your questions!

Yes, absolutely. Prodigy is centered around “recipes”, which are simple Python functions that return a dictionary of components – e.g. the stream of examples and optional functions to update the model, and customise other behaviours. You can find more details and examples of this in the custom recipes usage workflow. The source of the built-in recipes is also shipped with Prodigy, so you can look at the code and take inspiration from it.

To make use of the active learning, all you need is a function that assigns scores to an incoming stream of examples and yields (score, example) tuples, and a function that takes a list of annotated examples and updates the model. A custom recipe to integrate a custom model could look something like this:

import prodigy
from prodigy.components.loaders import JSONL  # file format loader
from prodigy.components.sorters import prefer_uncertain  # or other sorter

@prodigy.recipe('custom-ner')
def custom_ner(dataset, source):
    model = load_your_custom_model()  # load your model however you want
    stream = JSONL(source)  #  assuming your data source is a JSONL file
    stream = model(stream)  #  assign scores to examples via model
    stream = prefer_uncertain(stream)  # sort to prefer uncertain scores

    return {
        'dataset': dataset,  # ID of the dataset to store annotations
        'stream': stream,  # stream of examples
        'update': model.update,  # update model with annotations
        'view_id': 'ner'  # use NER annotation interface
    }

You could then use the recipe as follows:

prodigy custom-ner my_dataset my_data.jsonl -F recipe.py

Sure! To make the most out of Prodigy’s intuitive interface, you could for example extract only the rejected examples from your dataset and reannotate them. This keeps the annotator focused on one task at a time. If you have a clearly defined label set of say, 5-10 labels, you could use the choice interface and only stream in the examples that were previously rejected. The annotator would then see the text with the highlighted entity, and would be able to select one of the available labels (to correct it) or “no label” if the span is not an entity.

There’s also a boundaries interface that lets you create entity (or other span) annotations by selecting the individual tokens. We’re also working on more “traditional” interfaces for use cases that require manual labeling.

Yes – I’ve actually outlined a few solutions and ideas for relationship annotation in this thread.

In general, you can freely mix and match the different annotation UI components and build your own interface. For example, if you annotation task contains spans, those will be rendered as entities within the text. If a span contains a label, it will be rendered next to the entity. If the task itself contains a label, it will be displayed as the headline above the task. So let’s say you need to annotate whether two entities that are part of a longer text are related. Your annotation task could look like this:

{
    "text": "entity A, some text in between, entity B"
    "spans": [{"start": 0, "end": 8, "label": "A"}, {"start": 32, "end": 40, "label": "B"}],
    "label": "RELATED"
}

This would show an annotation card with the headline “RELATED” and the two entities highlighted within the text. The task will be immediately intuitive to the annotator, and you’ll be able to collect one-click annotations on whether the two entities are “related” (of course, in a real-world example, you’d probably want to use a more descriptive relationship here).

If the built-in options are not enough, you can always use custom HTML templates. You can either specify the HTML directly in the task, e.g.:

{"html": "This is text. <strong>This is bold.</strong>"}

… or provide a 'html_template' string within the config returned by your recipe. The template will have access to all properties specified in the task as Mustache variables (which also allow nested objects!). For example:

{"relation": "SIBLINGS", "data": {"text1": "foo", "text2": "bar"}}
<strong style="color: red">{{relation}}</strong><br />
{{data.text1}} → {{data.text2}}
1 Like

Hi Ines,
First, thanks a lot for your detailed answers.
Another question:
Your system seems to work with a pre-trained model and improves it with the active learning process.
Do you also have a UI for the initial state when we need users to annotate documents before we have trained the model? (for example a UI similar to BRAT annotation tool)
If you do not such a UI, how does your active learning process handles a new model before training? (Will it ask the user about every word/randomly/…?)

Thanks again,
Boaz

There is a nice video where Ines goes through the whole process of training a model from scratch and those issues are discussed:

I recommend you to take a look and come back and ask questions if you still have them.

1 Like

Prodigy does come with an ner.mark recipe that uses the boundaries interface, which lets you highlight spans of text. You can see an example of this in the recipes overview. However, since marking entities manually is often unnecessarily tedious, you should only have to use this for edge cases or if your goal is to create gold-standard annotations.

To get over the “cold start problem” when training a new entity label, Prodigy lets you pass in a list of match patterns describing examples of the entities you’re looking for. Match patterns can include all properties available for spaCy’s rule-based matcher – so you can define single or multi-word tokens or use other linguistic annotations like part-of-speech tags. You can also use the terms.teach and terms.to-patterns recipe to create a terminology list from a number of seed terms using word vectors, and convert the list to match patterns.

When you start training, Prodigy uses the patterns to start suggesting entities and will collect the first set of examples to update the model in the loop. As the model improves, it will also start suggesting entities based on what it’s learned so far from the pattern matches.

We actually just recorded another video tutorial that shows an end-to-end example of training a new entity type from scratch starting off with only 3 seed terms:

You can find more details in the docs and this thread. I’ve also posted a quick TL;DR version of the workflow in this comment.

1 Like