Spancat: use of embeddings, compatibility with transformers, upstream to relationship extraction

I first posted a question in Mapping relationships between named entities and unlabeled spans, which @SofieVL was kind enough to answer and helped clarify my questions. I'm starting a more focused topic here because I think it might be useful for the community.

First, as with NER, can we leverage pretrained word vectors for spancat? And can spancat be adapted to work with transformers models? (Adapting data labeled in prodigy to a compatible format for NER was just a simple shift to IOB tags...)

Assuming spancat is the best option for us, given that we have overlapping entities, we have two types of relationship use cases we'd like to replicate. In the first type, one span is overlapping with another.

Screen Shot 2021-11-12 at 9.57.37 AM

In the second, the spans are separated

It seems like the relationship extraction recipe could allow us to label the relationships between body locations and other labels where they are separated. Overlapping spans wouldn't show up, but where spans are overlapping shouldn't require a model anyway. Having not modeled relationships before, I'm just trying to validate my assumptions here; any feedback would be appreciated.

Thank you!

Hi!

There are quite a few questions here, so let me try and address them one by one:

Whether pretrained word vectors are used or not, is actually a setting with the tok2vec component or sublayer of the NER model. For instance, HashEmbedCNN has an option called pretrained_vectors so you could do something like this:

[components.spancat.model.tok2vec]
@architectures = "spacy.Tok2Vec.v1"

[components.spancat.model.tok2vec.embed]
@architectures = "spacy.MultiHashEmbed.v1"
include_static_vectors = true
...

Yes, this should work just like it would for NER: the tok2vec sublayer would just point to a TransformerListener instead of a Tok2Vec model or Tok2VecListener. If you're unsure about these terms, it probably makes sense to dive a bit more into spaCy's documentation here: Embeddings, Transformers and Transfer Learning · spaCy Usage Documentation

I do want to add that we haven't tested the spancat extensively yet with transformers (this is ongoing). If you run into issues with that, you can report them on spaCy's discussion forum: explosion/spaCy · Discussions · GitHub

1 Like

Personally, as I wrote on the other thread, I wouldn't extract "tonsillar" as a separate entity as it feels like "tonsillar enlargement" is the actual entity that you'd annotate as a whole. It's difficult to say up-front what the best annotation scheme will be, and the proof will be in the pudding...

1 Like

Thanks, Sofie. Tonsillar enlargement is definitely a span of interest. But do you see any issue with using spancat to label tonsillar as a separate, overlapping span? NER has worked extremely well for capturing e.g. tonsillar enlargement, but we want to be able to add body parts and potentially other labels (and then relate them). I read your original response (Mapping relationships between named entities and unlabeled spans - #2 by SofieVL) as a warning about how splitting up the entities might hinder the model form learning, but wheras NER would require us to split the entities, spancat wouldn't. So for that reason, spancat seems like a better use case to me. Would you push back on that?

No - if you're keen on trying this out, I'd say go for it and let us know how you go! This type of flexibility is definitely why we created the spancat in the first place.

1 Like