Hi. I'm trying to create a custom co-reference model and I'd like to use neuralcoref to give me a headstart manually annotating my data with co-reference relationships. Is there an easy way to format neuralcoref inferences for prodigy rel recipes? Thanks!
Thanks for your message and welcome to the Prodigy community
Unfortunately, we don't have an off-the-shelf "correct" recipe for
neuralcoref. As the ticket below mentions, what would be important is to understand the format required (which I think is part of your question):
Also, you may want to look at the
dep.correct recipe to see how a "correct" recipe with the
"relations" user interface (note that link shows what is the format for the
"relations" UI which may be helpful too). To view this recipe, run:
python -m prodigy stats
Where you should then see the
Location: where Prodigy has been installed. In that location folder, look for the file
recipes/dep.py where you'll see the
Similarly, you may find the
coref.manual recipe to be helpful too, which you can find in the
recipes/coref.py. Essentially, you'd want to combine the idea of "correcting" the
neuralcoref model into the
If you're able to get a recipe, feel free to post back and/or post it as a GitHub gist! We would greatly appreciate it. Hope this helps!
Hey @ryanwesslen thanks for getting back to me and answering my question. If I do end up creating a custom recipe I will 100% share it here.
I have a few additional follow up questions. When using either the coref or rel recipe to annotate coreferences, I'm a little confused which direction the relationship should go. I read somewhere in the prodigy documentation that it's not so important with coref because they are simply pairs of references, but wouldn't this have an impact when it comes to training a model? Also, how should you deal with pronouns that link to multiples entities e.g. "The doctor and nurse saw the patient. They did a great job." Should you assign a reference from 'doctor' to 'they' and 'nurse' to 'they'. In this case isn't the head / child of the relationship important? I'm not sure which should be the head and which should be the child?
It's a subtle point, but the scoring - including the loss calculation - only cares about whether two mentions are in the same cluster, it doesn't have a notion of direction in relationships. You can think of this as undirected graph.
For more details, I would highly recommend our recent deep dive post on Neural Coreference Resolution in spaCy:
You would link "the doctor and the nurse" as one mention to "they". But in this case, this means that "the nurse" cannot be linked separately (it can in data but the model can't find it). This is the "split heads" referenced in the blog post.
More precisely the problem is treated as a clustering problem over non-overlapping spans in a document. The non-overlapping constraint renders the system incapable of handling the "split antecedent" problem. For example in "Alice and Bob said they like cheese, but he prefers sushi." The pronoun "they" refers to "Alice and Bob" and "he" refers to "Bob". However, the span "Bob" is inside "Alice and Bob" so we have to choose to either resolve "they" to "Alice and Bob" or "he" to "Bob". The lack of split antecedent handling is a limitation of many coreference resolutions systems including ours.
As the post outlines, the spaCy core team's work on
coref component is still experimental but likely could be very helpful for you. Since it's experimental, we haven't fully integrated it into Prodigy yet but there's a lot of opportunity with a custom recipe. If you have more questions, I would suggest posting on the spaCy discussions forum as that's where the spaCy core team answers spaCy specific questions (this forum is for Prodigy-specific questions).
Thanks for this information, it's really helpful!
Also, be on the look out very soon for an accompanying spaCy coref video tutorial (with code too!).
I just saw a sneak preview and it's an excellent summary of the post!
Just released the new experimental
coref video by Edward and team:
Here's the GitHub code too.