I am new to the SpaCy/ Prodigy ecosystem so this is more of general question: can the annotated files created while using Spancat, specifically overlapping spans, be used to fine tune specialized models (like BiomedBERT) on Hugging Face (ideally with AutoTrain)? There is an implied (1) "could you do this" and a separate (2) "should you do this".
Context:
I initially found Prodigy when looking for tools that could properly label complex overlapping spans in medical text because NER felt too restrictive to properly annotate the text and text classification was not a precise enough tool for complex data.
The tradeoff of using a more precise tool like Spancat seems to be the downstream ecosystem is limited/ existing model architecture is not set up for this. Specifically, I previously used Text Classification (Binary and Multi-label) datasets to effectively fine tune specialized models on HuggingFace (BiomedNLP-PubMedBERT-large-uncased-abstract
) for very effective results. This process works quite well for simple text data (about 60% of our data). But a lot of medical data is not simple, so the 40% that could not be easily labeled causes a significant issue, which is why we are looking for a more precise tool like spancat, to breakdown text data better.
But this brings up the tradeoff, I can now label the text more accurately, and theoretically drive better results using this spancat data to fine-tune a specialized transformer model; but I don't really understand if the lego pieces can fit together. Previously, BiomedNLP-PubMedBERT-large-uncased-abstract
fit fine with my simple text classification data; but I don't really understand if this will continue to drive accurate results. I will note I did see this discussion showing you could do this, I am just not sure if this actually works since the tokenizer the specialized model uses is just different.
From my understanding most I would more than likely need to flatten the overlapping spans to drive the model to predict just one categorization per token.
As stated above, I really don't know if this is a "can you do this" type of question versus a "should you do this"; or maybe its a "you can do this but you need to implement it in this specific way.
Breakdown of Question
With this context, to break down my above question into its smaller parts:
- Compatibility: Is it possible to use non Spacy specialized transformer models that have been pretrained on biomedical data, while also having the requirement to handle overlapping spans in the text?
- Annotation: If yes, what are the best practices for annotating data in Prodigy to be compatible with such transformer models?
- Fine-Tuning: Are there any recommended strategies for fine-tuning these specialized models in a way that they can work with overlapping spans? Can I just use AutoTrain
- Trade-offs: If not, what trade-offs should I consider when choosing between specialized domain knowledge and the ability to handle overlapping spans?
- Alternative Approaches: Are there any workarounds or alternative approaches that can help achieve both requirements? Such as flattening? Or breaking each label out?
I know the Spacy ecosystem is vast so this may be more of 'getting pointed in the right direction' type of question as well as general best practices/ how to think about the spacy ecosphere.