Custom recipe for Annotating Overlapping Spans

Is there a way to write a custom recipe to annotate overlapping spans in text possibly with different labels?

2 Likes

Hi!

The manual interface for labelling spans was primarily designed for sequence tagging tasks where the spans are represented as a sequence of token-based tags with one tag per token, which is how most NER implementations are designed. Allowing overlapping spans in the annotation interface would easily be confusing and misleading, because you wouldn't be able to use the data collected this way for the most common use cases. Aside from this, it'd also make the UI much more complex and there's not really a satisfying answer for visualizing multiple nested overlapping spans while still making the interface efficient and intuitive to use.

If you need overlapping spans, you could make multiple passes over the data – this works especially well if you have a hierarchical label scheme, because you can start with the top-level categories and then stream in the examples again with the more fine-grained labels.

Alternatively, you could also write a stream generator that keeps sending out the same example so you can label it multiple times until you reject it or send it back empty (which means all spans are added). If you set "instant_submit": true, the answer will sent back immediately and your recipe's update function will be called before the new task is sent out. So you could use that to check if the example needs more spans or if you can move on to the next one. Examples with the same task will all have the same _input_hash, so it should be pretty easy to merge the spans in the data afterwards.

Hi Ines,

From your answer I understand that the Spacy model can be trained using overlapping entities. Is that correct? If so, would you recommend training one model on multiple overlapping entities, or training a separate model for each individual entity?

And another question: does the Spacy model compare individual token (or span) accuracy and pick the highest value? If so, is it possible to output all entity predictions per token?

Many thanks in advance.

Kind regards,

Bart

No, the model predicts one label per token, so it can't be trained on overlapping entities. This is true for most named entity recognition model implementations, since classic "named entities" can't overlap by definition.

This is a bit more complex – see here for details and an explanation:

HI!

Hi, I would like to ask a question about the idea you suggested in the last paragraph of the original response.

I'm struggling with how to write an update function (and a corresponding generator) that upon receiving examples (of length=1) would inform the generator to send the same example again or send a new one based on some condition.

From what I've understood and experienced, the generator would yield batch_size number of tasks at the same time and when they are annotated, the annotation tool shows "No tasks available" and ends the annotation round.

The thing is, even with a low batch size, you're typically at least one batch "behind", because Prodigy will request new questions in the background so you never run out of tasks to annotate. And it alwas keeps the most recent annotation on the client so you can press undo without having to reconcile conflicts at the databae-level.

If you want your stream generator to be able to respond to the latest annotation that was just made, you can set "instant_submit": true in your prodigy.json. Each annotation will then be sent back immediately as it's made in the UI (you just won't be able to undo, so that's the main trade-off here).

1 Like

Hi @ines

I am thinking about trying to implement your suggestion. From my understanding of the Prodigy API I am struggling with how to get from the update callback to the stream generator?

So my callback gets an example from the UI that needs to be returned to the UI for more annotating, but the stream generator has already been primed.

So my code will look something like this:

def my_recipie(source, ...):
    
    def update(answer):
        if needs_more_anotations(answer):
            return True
        return False

    stream = get_stream(source)
    
    ...
    return {
    'stream': get_stream(source),
    'update': update,
    'config': {'instant_submit': True} # I think this is how I would do it on the fly rather than prodigy,json?
    ...
    }

I am not sure what the update function should do if needs_more_anotations comes back True, such that the stream doesn't proceed to the next example, but resends the last example.

Feel like there is some ninja level python required here!

Help appreciated.

Tried to make this work in practise and struggling.

What I am trying to achieve is: an example is sent to the UI. If the annotations that are returned (to update) pass a test, then the same example is sent back to the UI for additional annotations. I use a status object that update can change, and that my get_stream can read.

Here's my code:

@prodigy.recipe('ssl')
def ssl(dataset='examples', source='results.jsonl'):

    data = JSONL(source)
    status = {'annotate_slots': False, 'config': CLASS_CONFIG}

    def update(answers):

        answer = answers[0] # the UI send a list of 1 since instant_submit: true
        spans = answer['spans']

        if any(span['label'] in CLASS_CONFIG['labels'] for span in spans):
            status['config'] = SLOT_CONFIG
            status['annotate_slots'] = True
            status['current_example'] = answer
        else:
            status['config'] = CLASS_CONFIG
            status['annotate_slots'] = True

    def get_stream(data):
        for example in data:
            if status['annotate_slots']:
                task = {'text': status['current_example']['text']}
            else:
                task = {'text': example['text']}
            yield task

    stream = get_stream(data)
    stream = add_tokens(nlp, stream)

    return {
        'dataset': dataset,
        'view_id': 'ner_manual',
        'config': status['config'],
        'stream': stream,
        'update': update
    }

and my prodigy.json:

{
    "theme": "basic",
    "custom_theme": {},
    "buttons": ["accept", "reject", "ignore", "undo"],
    "batch_size": 10,
    "history_size": 10,
    "port": 8080,
    "host": "localhost",
    "cors": true,
    "db": "sqlite",
    "db_settings": {"sqlite": {"name": "prodigy.db", "path":"./"}},
    "api_keys": {},
    "validate": true,
    "auto_exclude_current": true,
    "instant_submit": true,
    "feed_overlap": true,
    "ui_lang": "en",
    "project_info": ["dataset", "session", "lang", "recipe_name", "view_id", "label"],
    "show_stats": false,
    "hide_meta": false,
    "show_flag": false,
    "instructions": false,
    "swipe": false,
    "split_sents_threshold": false,
    "html_template": false,
    "global_css": null,
    "javascript": null,
    "writing_dir": "ltr",
    "show_whitespace": false
  }

I am not sure if batch_size is relevant, but I tried this with batch_size = 1 as well.

What happens in both cases is that the UI never serves the additional spans even when the test in update() is passed. And when the first batch is consumed, the UI reports no tasks remaining.

Any advice would be much appreciated.

@Superscope Yes, the status object works as ninja Python :sweat_smile: Alternatively, you could also declare the state as separate variables and use nonlocal.

I think one potential problem in your recipe logic is that your stream iterates over data, so it will run for each example in data and either re-send from status or send whatever the example is. And if it sends from status, it will not send from data. I'm not sure that's what you want? It sounds like you want it to keep repeating the same example until it's done, and then move on to the next?

Hi @ines, thanks for your reply.

nonlocal, seen it, wondered, got it!

I am not sure how to make a generator continue to repeat the same example? Once an example has been produced by the generator, the next time it iterates it automatically goes the the next example.

There are various utilities for this in itertools etc., but the most straightforward one would be while True. It's a bit stressful too look at (haha) but it does the job – just make sure to break the loop once you want to move on so you don't get stuck.

In your logic, you can also use a task's _task_hash to identify it, so you know whether you're still waiting for feedback from task 1 and don't accidentally update task 2 with info from task 1, and so on.

Hi @ines apols but we are really struggling with this. Before I dive in, is it possible to create custom interface where we label spans and relations at the same time? or am i going to run into an overlapping span issue in the UI. Just thinking of how to work around this.

...on the approach in this thread, everytime we try to have the stream continue to loop on an example until some condition is met, when the first batch is completed (if it if is batch size 1) no more tasks are retrieved.

thanks in advance.

The built-in relations UI lets you annotate relations and spans at the same time – but not spans that overlap. Managing this type of complexity would be really difficult because every token could be part of any number of spans and any number of relations, including relations attached to both spans and tokens. I honestly can't think of an effective way to present all those layers in a single prompt without making things messy and inefficient.

What are you trying to train with the data btw?

The data is all very proprietary otherwise would share. In general we are trying to extract spans that indicate a desire to buy or sell, and the components of that desire. So take the following example. I'll use mobile phones as an example:

"Hi Jane, Apple 10s? Can I buy 3 of those for $200. Is that ok?"

So that would be labelled as:

Hi Jane, <BUY><PHONE>Apple 10s</PHONE>? Can I buy 3 of those for <PRICE>$200</PRICE></BUY>. Is that ok?

Hence the overlapping spans (BUY overlaps with PHONE, AMOUNT and PRICE). So what I am trying to achieve is when an example is presented first tag the text that indicates the purchase, then return the text to annotate the components.

Ok just to draw some closure on this, it looks like it's not possible to send an example back to the UI.
Note the following experiment I conducted:

@prodigy.recipe('test')
def test(dataset='examples', model=nlp):

    data = [({'text': ' task 1'}, {'text': 'task 2'})]

    def get_stream():
        for eg in data:
            yield eg[0]
            yield eg[1]

    stream = get_stream()
    stream = add_tokens(model, stream)

    return {
        'dataset': dataset,
        'view_id': 'ner_manual',
        'config': CONFIG,
        'stream': stream
    }

The code above as expected I get 'task 1' followed by 'task 2' sent to the UI. However the following:

@prodigy.recipe('test')
def test(dataset='examples', model=nlp):

    data = [({'text': ' task 1'}, {'text': 'task 2'})]

    def get_stream():
        for eg in data:
            yield eg[0]
            yield eg[0]

    stream = get_stream()
    stream = add_tokens(model, stream)

    return {
        'dataset': dataset,
        'view_id': 'ner_manual',
        'config': CONFIG,
        'stream': stream
    }

I get 'task 1', only once instead of twice. Is this maybe because the to avoid duplicates in the database if an example's hash already exists prodigy skips it?

The reason I am asking is because @ines you seem to imply that we should be able to resend examples to the UI. Please confirm if this is possible, ideally with a code example.

Thanks much.

How are you modelling this? I'm just asking, because predicting the BUY intent as a sequence task seems pretty tricky, because it's probably very easy to end up with ambiguous boundaries?

Sending an example more than once is definitely possible, it's what a lot of workflows are based on. It's not fully clear from your recipe but I'm assuming you're using instant_submit and a batch size of 1? In that case, the annotation will already be in the database and the second example will be filtered out if it has the exact same hashes.

An easy way to solve this would be to use the same input hashes but different task hashes for each duplicate. This lets you tell the questions apart but also makes it easy to detect which questions are related to the same text. Another thing to watch out for when duplicating examples and modifying them: don't forget to deep-copy the dictionaries (e.g. using copy.deepcopy) – you don't want to be writing to the same object.