Convert spancat annotations for use with transformer model

DGMS90 · July 26, 2022, 2:08pm

Hey,

I would like to be able to use e.g. bert-large-cased or spanBERT as the embedding layer for a span categorisation task. Importantly, these spans will often be overlapping. However, I can't seem to find examples of how to convert my Prodigy annotations specifically for span categorisation (I can see examples for NER tasks) into compatible annotations for use with these transformer models.

Could someone point me in the right direction please?

Thanks,

Darren

koaning · July 27, 2022, 12:18pm

The output from either recipe is very similar.

Suppose I annotate this one example:

{"text": "My name is Vincent"}

Via both of these interfaces:

# NER interface
python -m prodigy ner.manual example-ner blank:en examples.jsonl --label name
# SPANCAT interface
python -m prodigy spans.manual example-span blank:en examples.jsonl --label name

Then the output is nearly identical.

NER

This is the output from python -m prodigy db-out example-ner:

{"text":"Hi. My name is Vincent","_input_hash":1333440749,"_task_hash":39342451,"_is_binary":false,"tokens":[{"text":"Hi","start":0,"end":2,"id":0,"ws":false},{"text":".","start":2,"end":3,"id":1,"ws":true},{"text":"My","start":4,"end":6,"id":2,"ws":true},{"text":"name","start":7,"end":11,"id":3,"ws":true},{"text":"is","start":12,"end":14,"id":4,"ws":true},{"text":"Vincent","start":15,"end":22,"id":5,"ws":false}],"_view_id":"ner_manual","spans":[{"start":15,"end":22,"token_start":5,"token_end":5,"label":"name"}],"answer":"accept","_timestamp":1658923692}

SPANCAT

This is the output from python -m prodigy db-out example-span:

{"text":"Hi. My name is Vincent","_input_hash":1333440749,"_task_hash":39342451,"tokens":[{"text":"Hi","start":0,"end":2,"id":0,"ws":false},{"text":".","start":2,"end":3,"id":1,"ws":true},{"text":"My","start":4,"end":6,"id":2,"ws":true},{"text":"name","start":7,"end":11,"id":3,"ws":true},{"text":"is","start":12,"end":14,"id":4,"ws":true},{"text":"Vincent","start":15,"end":22,"id":5,"ws":false}],"_view_id":"spans_manual","spans":[{"start":15,"end":22,"token_start":5,"token_end":5,"label":"name"}],"answer":"accept","_timestamp":1658923725}

Spans

In particular, you'll notice that the spans are identical.

"spans":[{"start":15,"end":22,"token_start":5,"token_end":5,"label":"name"}]

In fact, in this case, you can even run:

python -m prodigy train --spancat example-ner
python -m prodigy train --spancat example-span

As far as spancat is concerned, it just tries to learn from annotated "spans". The main difference is that NER won't allow for spans that overlap while spancat does. So I think you won't need to really worry about a "translation" when it comes to the NER -> span annotations.

The only part that I might be worried about is the tokenizer. There might be some edge cases if you're interested in fine-tuning a transformer model. Are there any issues that you've come across while training? If so, could you share the commands that you tried to run with the error message?

DGMS90 · July 27, 2022, 3:12pm

Thanks for the response.

I've not begun training yet as I wanted to make sure my annotations were going to be tokenized correctly for the transformer model and task before I wasted too much time annotating.

I assume I can use something like this with a custom tokenizer for the transformer model I choose when making the annotations?

And then do I just point to the transformer model directory in the spacy training config file when it comes to training to create the transformer component then the spancat component finetuned on top?

Thanks,

Darren

ryanwesslen · August 2, 2022, 12:44pm

Hi @DGMS90!

Yep - that would be the best reference to use.

Yep. I'd recommend looking through spaCy community issues like this below that detail how to appropriately set up config for transformer/spancat.

Hope this helps and let us know if you have further questions!

DGMS90 · August 11, 2022, 9:42am

Thanks very much to both. The work you're doing is great - I really appreciate it.

Topic		Replies	Views
Transform annotations to match tokenization required for SpanBERT/BERT spacy , transformers , spancat	19	1611	July 30, 2023
Integrating SpanCat with HuggingFace, specifically AutoTrain usage , transformers , spancat	2	413	October 25, 2023
Annotation interface to do both SpanCat and NER ner , spancat	2	564	August 31, 2022
train --spancat questions usage , transformers , training , spancat	2	760	January 26, 2022
NER for Financial Text ner	14	1623	October 25, 2023

Convert spancat annotations for use with transformer model

NER

SPANCAT

Spans

Related topics