I'm annotating semantic relations on the NER result created by spans.manual. But when I run the rel.manual recipe, the nested entities will be overlapped by the maximum span. Below is an example.
As you can see, “ranking blog posts” is labeled as “Task” while “blog posts” is labeled as “Material”. But in the relations UI, just “ranking blog posts” is shown as an entity, which prevents me from annotate the relation between “blog posts” and “relevance”.
I have read some discussion about nested spans and learned about solutions such as running multiple passes on the data or encoding entity types with relation labels. These solutions are good, but there are some difficulties for me to implement them. I want to know if there is an easier method to solve this problem.
relations UI currently only visualises one level of spans – you're already able to connect every token to every other token or span, so including overlapping and nested spans in the same interface can easily get very messy and you'd lose a lot of the advantage of visualising relationships as arrows like this.
If your data includes this level of complexity, one option could be to focus on connecting one concept a a time, for example, relationships of
Task spans to each other, or relationships between
Material. This also lets you narrow the scope and only allow annotations that are actually valid/useful: Is it possible for tasks to by hyponyms of materials? If not, you probably don't want to even include this option. If tasks can only be hyponyms of tasks, you also know that this is not an option to consider for examples that only contain a single
Task. All of these are decisions that can make the annotation process a lot fasster and more efficient.
In this specific case, another thing to think about is whether the relations view is actually the right conceptual model for this. For example, if your goal is to detect hyponyms based on pre-annotated phrases (that you already know), directional relationships aren't necessarily the best way to express this idea. What you're annotating here is "one or more instances of the existing spans relate to the same concept". So you could frame the same thing as a binary/multiple choice question about task A and B.
I’m sorry this example may be misleading. I just showed necessary labels instead of the complete label set for convenience. The complete annotation task is more complex.
Your suggestion is inspiring. I hope the annotation system is able to handle various situation, but just as you say, many options can be excluded from start.
But I still have another problem that how to handle the nested entities with same type. Below is an example(still just necessary labels are shown).
Both “Hidden Conditional Ordinal Random Field ( H-CORF ) framework” and “H-CORF” are labeled as
Method. And following our rules, they have relation “SYNOYM-OF” which means that two spans represent a same entity.
I have considered to change our rules. For example, just label “Hidden Conditional Ordinal Random Field” and “H-CORF” then the problem won’t occur. I want to know if there are other better solutions.
Thanks for your help!
What are you doing with the data later on and is this something your model will be trying to predict? I think this is an example where framing the problem as a relation extraction task is making it a lot harder than it should be. First, the model needs to extract the spans correctly, next it needs to predict the relationship correctly, using the features that are pretty obvious (first letters of the previous expression) but that might not be factored in because it's not the case for all synonyms. So you're essentially framing an abbreviation expansion task as a much more complex relation extraction problem with a lot more potential for annotation errors and predicted false positives and negatives.
On a related note, you might find scispaCy's approach to abbreviation detection interesting for covering synonyms that are abbreviations: GitHub - allenai/scispacy: A full spaCy pipeline and models for scientific/biomedical documents.
In general, I think moving towards more consistent units and avoiding different overlapping spans for different relation types is a better approach. You'll be able to collect much more consistent data if your annotation scheme is stricter about the spans and syntactic constitutents you want to annotate. You don't want it to be too easy to introduce mistakes and inconsistencies during annotation.
Also, it obviously depends on your downstream application and what your goal is, but if you're looking to train a model to make these predictions and want to use the result in some other process, it's usually very important to make the spans consistent syntactic units. Your model becomes a lot less useful if it outputs completely arbitrary spans. I've explained this a bit more in this section of the "Applied NLP Thinking" post: Applied NLP Thinking: How to Translate Problems into Solutions · Explosion
Thanks for your help! I have solved this problem now!