Dynamic choices for binary long-range coreference

SandstoneGolem · December 14, 2021, 11:12am

Thank you for the work on Prodigy. I've used the standard NER annotation options and have been very happy with its speed and ease of use. Now I stand before a more complex scenario.

I want to annotate coreference relations between entity mention in long ranges of text -- short stories. So a mention of (let's say) Gatsby on page 1 should be in a cluster with a mention of Gatsby on page 50. We are not marking pronouns. I have gold mentions to annotate.

Obviously we can't load up the entire novel in the rel annotation UI. One option I came up with was to first segment the story in block of a few paragraphs, annotate that with the rel UI, and then somehow jump up to merging those sub-clusters on the level of the whole story. This could work but has drawbacks, not the least being the cognitive load of annotating long spans of text.

Another option is to stream in binary coref decisions, e.g. "is coreferent with ?" In that case, I would need to stop the stream when a positive answer is given, since the mention doesn't need to belong than more than one cluster. Is it possible to manipulate the stream in this way?

Ideally, I would want to stream in multiple-choice annotation decisions. So the annotator would see in a text block, and the multiple choice options would be each of the previously identified clusters. The choices/clusters would need to be continuously dynamically updated. Is this possible in any way?

ines · December 20, 2021, 10:48am

Hi! This is an interesting use case and your idea of clustering the mentions together sounds like a good approach. It's also something we've found very promising for other tasks like NER – a nice side-effect is that it also lets you take advantage of word distributions (see Zipf's law etc.). Mentions of the main character for instance are likely going to be very common, so you'll only need very few grouped decisions to cover the majority of all coref clusters.

I really like the idea of the binary "is this coreferent with X?" workflow. You could probably implement this by having the stream loop over your data and collect the mentions, and then updating a global/nonlocal variable in the recipe in the update callback that receives batches of answers. If the answer is "accept", you'll know that you've assigned a cluster. Because the stream is a generator and only consumed in batches, it can respond to and change based on outside state (which is also how annotation with a model in the loop works).

For the multiple choice annotation flow, you could probably solve this with a custom recipe and the choice UI, or maybe even a custom interface with two blocks, relationsand choice. In the recipe, you'd then load in your pre-segmented text and pre-populate the "options" with some of the most common options (maybe you want to pre-annotate a small sample here). In the update callback of the recipe, you can then access the created annotations, and use this to update a global/nonlocal variable of the options, that's then used in the stream.

options = [...]

def get_stream(stream):
    for eg in stream:
        eg["options"] = options
        yield eg

def update(answers):
    nonlocal options
    # update options from answers

One thing to keep in mind in mind when updating the stream dynamically is that you'll typically have 1-2 examples "travelling" while the next examples is already queued up, so even with a batch size of 12, it may take another example for the options to be updated. I've explained this in more detail in this thread on using a matcher in the loop:

Prodigy Custom Model; Model in the Loop (matcher)

One thing to keep in mind is that Prodigy will typically queue up examples in the background as you annotate, so you don't have to wait between annotations. So even with a batch size 1, example 2 will already be requested from the stream while you annotate example 1, and so on. Prodigy will also keep one batch in the app so you can easily go back and undo, without ending up with multiple conflicting versions on the back-end if you made a mistake during annotation. So there'll always be a small delay of 2 * batch_size before the examples you've annotated hit your update callback.

IMO, this is an okay trade-off for the solution you want to implement, because you'll likely have an uneven distribution of entities anyway and the same span won't be present in every example. So it may happen that you annotate "car" in example 1, see example 2 with "car" that wasn't pre-labeled, keep annotating and then see "car" pre-labeled in example 4 and any future examples.

SandstoneGolem · December 22, 2021, 8:35am

Hi Ines,

Thanks for the detailed reply! I'm happy to hear that it could be possible to implement my ideas. I will dive into it in January. If I end up with satisfying I can share the results with the community.

Topic		Replies	Views
Coreference resolution usage , relations , coref	2	1452	February 16, 2021
labeling coreference task with 1-4 corefs per ~200-300 words paragraphs best-practices , coref	1	476	February 11, 2022
Is Prodigy suitable for cross-document coreference resolution with diverse types of entities and reference? ner , relations , coref	2	298	June 7, 2023
Annotating coreference on NER annotated text usage , ner , coref	3	242	May 13, 2024
Format neuralcoref inferences for use with prodigy relations recipe solved , relations , coref	6	435	November 2, 2022

Dynamic choices for binary long-range coreference

Related topics