Span vs NER, compatibility with transformers models

Hey what a good community, we are trying to strcutured information in medical domain and he have a good cuestions that maybe more people will benefict.

Our aims is from this clinical note : "Multifocal unspecified and lobular breast carcinoma, g1, re100%, rpg100%, negative HER2, ki 10%, pt1b, pn0 cm0 stage IA TNM 8th edition, treated with chemotherapy."

We want to recognize small entities as cancer type or cancer location , and also to know this entites are related with the treatment.

We proposed two diferents aproach:

  1. Train one NER model to extract small entites , then other to extract big entites like disease, stage, testing, treatment, and finally other relations model that link big entities only, because with spacy you can know if a small entiti is in a big entite.
  2. Train one Span model that annotate big and small entites , and then the relation models.

My questions:

  1. Span model is compatible with relations extraction?
  2. This models can be trainned with transformers models usign spacy ?

If someone could contribute , I think it will be a great discussion not only for our project but also for others.

hi @Alvaro8gb!

Thanks for your message. That sounds like a fascinating project.

Check out this post from spaCy GitHub Discussions:

The example REL component doesn't work out of the box with spancat, but it should be possible to make it work. You'd need to modify the code to use the spangroups assigned by your spancat instead of entities on the Doc.

it looks like the only place you'd need to modify to get it working is the instance generator. That's designed so that you can register your own alternative generator instead, too, so you can copy it, give it a new name, and modify your config accordingly. The evaluation script would also need modification, and technically the component could be changed to not rely on doc.ents, but that's more of a bookkeeping detail, and shouldn't affect functionality.

For ner and spancat, there are lot of relevant posts on spaCy GitHub on transformers.

If you're training spancat, be aware that memory can be an issue if you're not careful. This is more the case when you may have long spans. Typically, modifying the suggester function or batch size can help (see this post).

If you're interested in using transformers for the rel_component, Sofie recently released an accompanying blog (see transformer section):

I'd suggest if you have issues, post on spaCy GitHub discussions forum. The spaCy core team supports that forum (this forum is mainly for Prodigy-specific questions) and they'll can help more if you have a config.cfg file you're debugging.

Not specific to transformers, but since you mentioned considering NER vs. spancat, have you seen the spaCy team's ner_spancat_compare template project?

It provides an interesting experiment comparing ner and spancat performance on biomedical literature. They do an excellent job too of exploring span characteristics metrics to provide intuition as to how well spancat will identify correct spans.

You can clone this repo if you have spaCy by running spacy project clone experimental/ner_spancat_compare. You can then fetch assets with spacy project assets then run spacy project run all.

Also, have you considered rules as well? Were you aware of spanruler, which enables overlapping spans like spancat?

My colleagues @victorialslocum and @ljvmiranda921 have created a template project and an accompanying blog post for using spanruler with ner, which may add another option with rules.

Hope this helps!

Thank RYAN very much for the complete answer, it has helped me a lot

1 Like