Hi! This is an interesting use case and your idea of clustering the mentions together sounds like a good approach. It's also something we've found very promising for other tasks like NER – a nice side-effect is that it also lets you take advantage of word distributions (see Zipf's law etc.). Mentions of the main character for instance are likely going to be very common, so you'll only need very few grouped decisions to cover the majority of all coref clusters.
I really like the idea of the binary "is this coreferent with X?" workflow. You could probably implement this by having the stream loop over your data and collect the mentions, and then updating a global/nonlocal variable in the recipe in the update
callback that receives batches of answers. If the answer is "accept"
, you'll know that you've assigned a cluster. Because the stream
is a generator and only consumed in batches, it can respond to and change based on outside state (which is also how annotation with a model in the loop works).
For the multiple choice annotation flow, you could probably solve this with a custom recipe and the choice
UI, or maybe even a custom interface with two blocks, relations
and choice
. In the recipe, you'd then load in your pre-segmented text and pre-populate the "options"
with some of the most common options (maybe you want to pre-annotate a small sample here). In the update
callback of the recipe, you can then access the created annotations, and use this to update a global/nonlocal variable of the options, that's then used in the stream.
options = [...]
def get_stream(stream):
for eg in stream:
eg["options"] = options
yield eg
def update(answers):
nonlocal options
# update options from answers
One thing to keep in mind in mind when updating the stream dynamically is that you'll typically have 1-2 examples "travelling" while the next examples is already queued up, so even with a batch size of 12, it may take another example for the options to be updated. I've explained this in more detail in this thread on using a matcher in the loop: