Dynamic choices for binary long-range coreference

Thank you for the work on Prodigy. I've used the standard NER annotation options and have been very happy with its speed and ease of use. Now I stand before a more complex scenario.

I want to annotate coreference relations between entity mention in long ranges of text -- short stories. So a mention of (let's say) Gatsby on page 1 should be in a cluster with a mention of Gatsby on page 50. We are not marking pronouns. I have gold mentions to annotate.

Obviously we can't load up the entire novel in the rel annotation UI. One option I came up with was to first segment the story in block of a few paragraphs, annotate that with the rel UI, and then somehow jump up to merging those sub-clusters on the level of the whole story. This could work but has drawbacks, not the least being the cognitive load of annotating long spans of text.

Another option is to stream in binary coref decisions, e.g. "is coreferent with ?" In that case, I would need to stop the stream when a positive answer is given, since the mention doesn't need to belong than more than one cluster. Is it possible to manipulate the stream in this way?

Ideally, I would want to stream in multiple-choice annotation decisions. So the annotator would see in a text block, and the multiple choice options would be each of the previously identified clusters. The choices/clusters would need to be continuously dynamically updated. Is this possible in any way?

Hi! This is an interesting use case and your idea of clustering the mentions together sounds like a good approach. It's also something we've found very promising for other tasks like NER – a nice side-effect is that it also lets you take advantage of word distributions (see Zipf's law etc.). Mentions of the main character for instance are likely going to be very common, so you'll only need very few grouped decisions to cover the majority of all coref clusters.

I really like the idea of the binary "is this coreferent with X?" workflow. You could probably implement this by having the stream loop over your data and collect the mentions, and then updating a global/nonlocal variable in the recipe in the update callback that receives batches of answers. If the answer is "accept", you'll know that you've assigned a cluster. Because the stream is a generator and only consumed in batches, it can respond to and change based on outside state (which is also how annotation with a model in the loop works).

For the multiple choice annotation flow, you could probably solve this with a custom recipe and the choice UI, or maybe even a custom interface with two blocks, relationsand choice. In the recipe, you'd then load in your pre-segmented text and pre-populate the "options" with some of the most common options (maybe you want to pre-annotate a small sample here). In the update callback of the recipe, you can then access the created annotations, and use this to update a global/nonlocal variable of the options, that's then used in the stream.

options = [...]

def get_stream(stream):
    for eg in stream:
        eg["options"] = options
        yield eg

def update(answers):
    nonlocal options
    # update options from answers

One thing to keep in mind in mind when updating the stream dynamically is that you'll typically have 1-2 examples "travelling" while the next examples is already queued up, so even with a batch size of 12, it may take another example for the options to be updated. I've explained this in more detail in this thread on using a matcher in the loop:

Hi Ines,

Thanks for the detailed reply! I'm happy to hear that it could be possible to implement my ideas. I will dive into it in January. If I end up with satisfying I can share the results with the community.

1 Like