Is Prodigy suitable for cross-document coreference resolution with diverse types of entities and reference?

I am looking for a tool where I can annotate coreferences across documents. Additionally, the tool should allow for defining several types of (co)reference. Furthermore, it should be possible to define several types of entities according to a hierarchical tree, e.g. a mention "Joe Biden" would be marked as being of the entity type "Person" and could then be connected to a super-entity "USA" (regardless of whether this super-entity is also mentioned in the document or not).

Just from its live demo version, I cannot not tell whether Prodigy as a tool can do all of that or not. And I don't want to buy the license first to then find out that it can't. So from your experience, would Prodigy be the right annotation tool for me? Thanks a lot for any kind of advice!! :slight_smile:

Hi @jakob.vogel,

In a complex project like you're describing, it's always good to break it down into individual NLP tasks. Typically, each such NLP task would require a designated annotation project. Even though it means that several passes over data are required, from the point of view of data quality, it's usually better because annotators can be focused on one task at a time and it's easier to perform data quality checks such as e.g. computing inter-annotator agreement.

Let me start with suggesting such break down with some definitions of the NLP tasks involved, just to make sure we are on the same page:

Step 1. Named Entity Recognition (NER) would take care of recognizing entity mentions in the text. You mention the requirement for hierarchical categories. Hierarchical NER annotations are very easy to set up with Prodigy with a bit of custom scripting on top of Prodigy ner.manual recipe. Please check this example for one idea on how can that be done.
As a result of this first step, you'd end up with NER annotatated dataset you could use as input to step two.

Step 2. Cross-Document Coreference (CDCR) would take care of mapping the entities annotated in Step 1 to some real world entities. From the annotation perspective, Cross-Document Coreference is, actually, more similar to Entity Linking (EL) than to a "typical" or single document Coreference Resolution (for which you'd use Prodigy relations interface). Let me point out the difference here: while both CDCR and EL involve linking mentions of entities to some form of representation, they serve slightly different purposes. CDCR is about understanding and linking mentions within and across texts, while entity linking is about connecting the mentions to a knowledge base, allowing for a deeper understanding of the entities mentioned in the text by leveraging external structured information. It's unclear to me what you will try to achieve with Cross-Document Coreference annotations, but judging on the information provided, the annotation interface could be similar to that of Entity Linking in that the annotator would be selecting a real world entity label for each NER label from task 1. And that's definitely achievable in Prodigy.

For a better idea of how the recipe for the Step 2 might look like, you can check this custom Entity Linking recipe in Prodigy (around minute 12). Like I said, Entity Linking and Cross Document Corefence are, clearly, not the same, but the annotation interface could be similar.

In summary, Prodigy does not provide an "out_of_the_box" interface to do end-to-end annotation as described. But it does support solving this problem step by step, by annotating the data for each step/NLP task separately, which, as mentioned above, is the recommended way to approach complex annotations like the one described. It will require writing custom recipes in Python, but we provide excellent documentation and, of course, this support forum in case help is required.

Thanks a lot, that's very helpful!! I'll look into the two steps as you described them and see if that works for me. :slight_smile: