Convert spancat annotations for use with transformer model

Hey,

I would like to be able to use e.g. bert-large-cased or spanBERT as the embedding layer for a span categorisation task. Importantly, these spans will often be overlapping. However, I can't seem to find examples of how to convert my Prodigy annotations specifically for span categorisation (I can see examples for NER tasks) into compatible annotations for use with these transformer models.

Could someone point me in the right direction please?

Thanks,

Darren

The output from either recipe is very similar.

Suppose I annotate this one example:

{"text": "My name is Vincent"}

Via both of these interfaces:

# NER interface
python -m prodigy ner.manual example-ner blank:en examples.jsonl --label name
# SPANCAT interface
python -m prodigy spans.manual example-span blank:en examples.jsonl --label name

Then the output is nearly identical.

NER

This is the output from python -m prodigy db-out example-ner:

{"text":"Hi. My name is Vincent","_input_hash":1333440749,"_task_hash":39342451,"_is_binary":false,"tokens":[{"text":"Hi","start":0,"end":2,"id":0,"ws":false},{"text":".","start":2,"end":3,"id":1,"ws":true},{"text":"My","start":4,"end":6,"id":2,"ws":true},{"text":"name","start":7,"end":11,"id":3,"ws":true},{"text":"is","start":12,"end":14,"id":4,"ws":true},{"text":"Vincent","start":15,"end":22,"id":5,"ws":false}],"_view_id":"ner_manual","spans":[{"start":15,"end":22,"token_start":5,"token_end":5,"label":"name"}],"answer":"accept","_timestamp":1658923692}

SPANCAT

This is the output from python -m prodigy db-out example-span:

{"text":"Hi. My name is Vincent","_input_hash":1333440749,"_task_hash":39342451,"tokens":[{"text":"Hi","start":0,"end":2,"id":0,"ws":false},{"text":".","start":2,"end":3,"id":1,"ws":true},{"text":"My","start":4,"end":6,"id":2,"ws":true},{"text":"name","start":7,"end":11,"id":3,"ws":true},{"text":"is","start":12,"end":14,"id":4,"ws":true},{"text":"Vincent","start":15,"end":22,"id":5,"ws":false}],"_view_id":"spans_manual","spans":[{"start":15,"end":22,"token_start":5,"token_end":5,"label":"name"}],"answer":"accept","_timestamp":1658923725}

Spans

In particular, you'll notice that the spans are identical.

"spans":[{"start":15,"end":22,"token_start":5,"token_end":5,"label":"name"}]

In fact, in this case, you can even run:

python -m prodigy train --spancat example-ner
python -m prodigy train --spancat example-span

As far as spancat is concerned, it just tries to learn from annotated "spans". The main difference is that NER won't allow for spans that overlap while spancat does. So I think you won't need to really worry about a "translation" when it comes to the NER -> span annotations.

The only part that I might be worried about is the tokenizer. There might be some edge cases if you're interested in fine-tuning a transformer model. Are there any issues that you've come across while training? If so, could you share the commands that you tried to run with the error message?

Thanks for the response.

I've not begun training yet as I wanted to make sure my annotations were going to be tokenized correctly for the transformer model and task before I wasted too much time annotating.

I assume I can use something like this with a custom tokenizer for the transformer model I choose when making the annotations?

And then do I just point to the transformer model directory in the spacy training config file when it comes to training to create the transformer component then the spancat component finetuned on top?

Thanks,

Darren

Hi @DGMS90!

Yep - that would be the best reference to use.

Yep. I'd recommend looking through spaCy community issues like this below that detail how to appropriately set up config for transformer/spancat.

Hope this helps and let us know if you have further questions!

1 Like

Thanks very much to both. The work you're doing is great - I really appreciate it.