Integrating SpanCat with HuggingFace, specifically AutoTrain

I am new to the SpaCy/ Prodigy ecosystem so this is more of general question: can the annotated files created while using Spancat, specifically overlapping spans, be used to fine tune specialized models (like BiomedBERT) on Hugging Face (ideally with AutoTrain)? There is an implied (1) "could you do this" and a separate (2) "should you do this".

I initially found Prodigy when looking for tools that could properly label complex overlapping spans in medical text because NER felt too restrictive to properly annotate the text and text classification was not a precise enough tool for complex data.

The tradeoff of using a more precise tool like Spancat seems to be the downstream ecosystem is limited/ existing model architecture is not set up for this. Specifically, I previously used Text Classification (Binary and Multi-label) datasets to effectively fine tune specialized models on HuggingFace (BiomedNLP-PubMedBERT-large-uncased-abstract) for very effective results. This process works quite well for simple text data (about 60% of our data). But a lot of medical data is not simple, so the 40% that could not be easily labeled causes a significant issue, which is why we are looking for a more precise tool like spancat, to breakdown text data better.

But this brings up the tradeoff, I can now label the text more accurately, and theoretically drive better results using this spancat data to fine-tune a specialized transformer model; but I don't really understand if the lego pieces can fit together. Previously, BiomedNLP-PubMedBERT-large-uncased-abstract fit fine with my simple text classification data; but I don't really understand if this will continue to drive accurate results. I will note I did see this discussion showing you could do this, I am just not sure if this actually works since the tokenizer the specialized model uses is just different.

From my understanding most I would more than likely need to flatten the overlapping spans to drive the model to predict just one categorization per token.

As stated above, I really don't know if this is a "can you do this" type of question versus a "should you do this"; or maybe its a "you can do this but you need to implement it in this specific way.

Breakdown of Question
With this context, to break down my above question into its smaller parts:

  1. Compatibility: Is it possible to use non Spacy specialized transformer models that have been pretrained on biomedical data, while also having the requirement to handle overlapping spans in the text?
  2. Annotation: If yes, what are the best practices for annotating data in Prodigy to be compatible with such transformer models?
  3. Fine-Tuning: Are there any recommended strategies for fine-tuning these specialized models in a way that they can work with overlapping spans? Can I just use AutoTrain
  4. Trade-offs: If not, what trade-offs should I consider when choosing between specialized domain knowledge and the ability to handle overlapping spans?
  5. Alternative Approaches: Are there any workarounds or alternative approaches that can help achieve both requirements? Such as flattening? Or breaking each label out?

I know the Spacy ecosystem is vast so this may be more of 'getting pointed in the right direction' type of question as well as general best practices/ how to think about the spacy ecosphere.

My understanding is that spancat component in spaCy can still use a transformer under the hood. The transformer would still supply the embedding that's used to make the classification.

It's hard for me to give advice on the biomedical domain and how different architectures might be better or worse. The most honest advise that I might give is to consider that the entire process requires iteration. You're going to learn by iterating on your data and your model. It's usually an interaction between training your first model, seeing where it makes mistakes and making improvements from those lessons.

I suppose another bit of general advise: it might help to focus on one kind of span first, preferably one that is relatively easy but can already be helpful in a business context.

Does this help? If not I'll gladly dive in more deeply, but I want to be careful with making suggestions on what might work best because a lot of that depends on your dataset/task.

Hi there.

Since this thread discusses the training of NER transformer models, I figured I might ping and make anybody who is reading this in the future aware that we now have a Prodigy-HF plugin that comes with a hf.train.ner recipe. This recipe allows you to train a tranformer model directly. That means that, opposed to a spaCy pipeline, you'll get a specific model that only does NER. But it has been a common feature request because a lot of folks seem interested in training on top of very specific pre-trained transformer models that spaCy may not directly support.

If you want to learn more you can check out our updated docs. If you want to customise the training further you can also use these recipes as a place to start and customise.