Spancat + Textcat


I've had a look around regarding this usecase but it doesn't seem to be too well-documented. I was wondering if there's a possible way to pipe the outputs of a SpanCat model (i.e. Docs with identified Spans) into Prodigy, and then complete annotations for a TextCat model, e.g. whether the appearance of a particular entity is from Context A or Context B.

Steps Taken So Far

  1. I have labelled some datasets with the spans.manual recipe
  2. I have trained a SpanCat model on these datasets
  3. I have corrected the model usings spans.correct

What I Would Like to Happen Now

  1. I use the outputs of the SpanCat model to go through the annotated datasets again, this time providing annotations on the contexts of a particular entity
    a) Ideally this would be in the form of, "What is the context of this extracted entity? A, B, or C?
  2. Or, I revise the annotation method - and provide annotations for both the SpanCat and TextCat model in the first pass.

Classifying the whole origin text/sentence of the entities won't work as there might be multiple entities in each text, each with different contexts.

For a more concrete example, if I'm extracting Equipment and the context the equipment is being used, an example text might be "Mr X needs a wheelchair but is currently using a zimmer frame", I would like to indicate for the extracted entities of wheelchair and zimmer frame that the respective classifications are EQUIPMENT_NEEDED and EQUIPMENT_USED. Ideally there would exist a workflow within Prodigy that would facilitate labelling the latter as needed.

Many thanks

I think I've managed create a custom recipe to surface the SpanCat entities within a choice interface.

My next question is, is there any way of attaching the output of the choice to the specific entity being highlighted? I already have the recipe set up so that it only shows me one highlighted entity at a time - so I would imagine it might be possible to append to the properties of the highlighted Span.

Hi @monsoon ,

Apologies for the late reply, I just got to it but it looks like you've managed very well :+1:
As a general note, you can combine "views" using blocks interface and as long as your task contains all the information expected by different components of the blocks it should be rendered correctly.
As for asigning the textcat label to a particular spancat label I was going to suggest splitting the stream so that there's one span per task. Which, again, is what you did. That's the usual way to to it - also helps the annotator to focus on one span at a time. We have a helper for that split_spans hope you had found it in time.

Apologies for my delay in response - have been off over the weekend!

Thank you for the pointer on using the blocks interface, will give that a whirl.

And arghh, dang - wish I saw the split_spans earlier - might have saved some headache. As it stands, I've had to implement some custom logic which effectively selects a span from a piece of text, and a window of tokens surrounding that span. I think this will work well enough for a proof-of-concept.

To follow on from this I had another question (which may be too implementation-specific, but worth asking anyway). Assuming I'm now going through a stream where I'm presented with a highlighted span, and I'm selecting the most appropriate TextCat label for the span - would you have any recommendations on how to avoid confusing the classifier if two separate spans are contained very close to each other in the origin sentence, but with differing TextCat labels for their respective contexts?

For instance, in the example of "makes use of wheelchair but walking stick not used", I have two separate spans appearing in two separate contexts. In the stream, however, I would probably get the entire sentence shown for both instances of a span. If I mark, EQUIPMENT_USED for the first span, and EQUIPMENT_NOT_USED for the second span, I run the risk of providing essentially conflicting information to the classifier.

My mind goes to the fact that this will probably be more of an edge-case and probably best to ignore in the annotation interface, and that if I wanted to navigate this issue I could perhaps employ some kind of clause segmentation on top of the already-existing sentence segmentation (which would either segment further on conjunctions/commas/etc.).

Or perhaps the window I use to provide a subsection of the sentence stops at the boundary of another span? But then I suppose this would run the risk of breaking on a sentence like "He uses a wheelchair, walking stick, and a zimmer frame" whereby all spans are tied to the one classification of EQUIPMENT_USED, and all spans appear adjacently.

Again, I would gladly appreciate any suggestions.

I guess what I'm also trying to ascertain is if in the following sentence I've identified two spans "The wet room needs a handrail.", is there an effective way of providing / formatting labels to a TextCat that would allow it to classify the differing contexts of the two spans? i.e. that the wet room actually exists, and the handrail doesn't exist yet.

With the correct prompt, I can elicit the right response from a GPT-4 model, so I'm thinking there must be a way of correctly annotating the dataset within Prodigy to subsequently train a lightweight spaCy model.

Hi @monsoon,

I think you're right in your concern about ending up with conflicting labels for the textcat.
One way would be to try to split the sentences into meaningful clauses where each clause contains the information about a single span but you have already verified it's not exactly viable.
Another method could be to mask the other span mentions in each example so that the model only sees the lexical realization of one entity but not the the others. For example,
" makes use of wheelchair but walking stick not used"
would be translated into 2 masked examples like so:
" makes use of wheelchair but EQUIPMENT not used"
" makes use of EQUIPMENT but walking stick not used"
This is also how the examples should be annotated that is any splitting into clauses and masking should be applied to Prodigy input (could be an external preprocessing script) and of course at inference time as well.
Actually, I think we have a very detailed description of how such solution could be implemented within spaCy/Prodigy framework in this blog post. This project addresses a very similar challenge to yours and provides solutions for clause splitting and masking. I think that approach is worth testing in your case. I'll let you have a look at the project (the blog contains links to repos and demos) and let's take it from there.

Hi @magdaaniol,

Wow, thank you for linking that blog post - it's incredibly informative, and is pretty much 1:1 with what we're attempting to achieve. I'll have read of it, digest it, (try to) implement it, and let you know how it goes.

I think solving the annotation problem and progressing with the project would likely put the content of this thread beyond the remit of Prodigy support - but I will likely appear again in a couple of weeks to give a little update!

Many thanks :slight_smile:

1 Like