Mapping relationships between named entities and unlabeled spans


I apologize if this topic has been covered elsewhere, but I'd like to validate my assumption (or hear out a different approach you'd suggest) that in order to create relationships between problems, tests, and treatments (our current labels for one of our medical NER models) and body locations (for which we do not currently have a label), we would need to first decouple the body parts from existing entities, such that 'knee injection' (a treatment) becomes 'knee' (a body part) and 'injection' (a treatment). And then, after body parts are a separate label, we could map relationships between them and problems/tests/treatments.

If this seems right to you that we need a separate label for body parts, would it make sense to relabel with the span classification recipe so that body parts could be identified via patterns in the many cases where they overlap with something already labeled as an entity? (They also are expressed via impractical distances in the text ('Application to her left buttocks area 11/30/17 did alleviate some of the pain '(buttocks—> pain)) as well as one to many relationships as part of sections (anus: no masses, mild bleeding).)

We've had great success using prodigy to fine tune and build NER models that are amazingly performant with relatively little data, and have been helped along the way by the thorough support you provide in these forums. Thank you for everything you do.

1 Like


If you're looking at a challenge with overlapping spans, the spancat is definitely the way forward. How you want to predict the entities here depends a bit on the data though. While "knee injection" can be seen as two entities, "dental treatment" would be more awkward to split up. Then again, for sentences like the one around "buttocks" where the treatment and body part are not mentioned in a continuous span, you probably have to split them up into two entities anyway - there's no good solution otherwise.

Ultimately what it always boils down to is: what is the "easiest" way for a model to learn the information? If the entities are typically mentioned together and used as one phrase within the sentence, the model might find it easier to recognize them as one. A proxy for this, to determine what is "easiest", is by doing some of the annotation and trying out both schemes. Which of the two feels more natural and is easier to do? And which feels more intuitive as a human, interpreting language as we do? Chances are high that this will correlate to what is easier to do for a model, too (and thus, eventually, higher accuracy).

I might have strayed a bit from the original question - let me know if this helps or not!

1 Like