Efficient Named Entity Recognition Self-Learning in Medical Text Structuring with Limited Annotations

We specialize in the structured organization of textual information in medical scenarios. Due to the multitude of projects, there may be thousands of entities involved. However, annotating these entities is costly. Fortunately, we already have some transcribed data, which represents the final result text but lacks start and end annotations. What methods can be employed to achieve rapid self-learning for Named Entity Recognition (NER)?

Hi @tianchiguaixia,

Thanks for your question and welcome to the Prodigy community :wave:

There are several methods you can employ to achieve rapid self-learning for Named Entity Recognition (NER) using your existing transcribed data. Here are a few suggestions:

  1. Transfer Learning: This technique involves using a pre-trained model and fine-tuning it with your specific task. You can use a model that has been trained on a large corpus of text and then fine-tune it on your specific task. This can significantly reduce the amount of training data required and speed up the learning process.
  2. Active Learning: Prodigy, the tool this forum is dedicated to, is built around the idea of active learning. You start by training a model with a small amount of annotated data, then use the model to suggest the most uncertain examples to annotate next. This way, you're always focusing on the data that can teach your model the most.
  3. Semi-Supervised Learning: This involves using a small amount of labeled data and a large amount of unlabeled data for training. The idea is to use the labeled data to train an initial model and then use this model to label the unlabeled data. The newly labeled data can then be used for further training. This can be a very effective way to leverage your existing transcribed data.
  4. Bootstrapping: This is a technique where you start with a small set of seed examples and then iteratively train a model and use it to label more data. The newly labeled data is then added to the training set and the process is repeated. This can be a very effective way to rapidly increase the amount of annotated data.
  5. Rule-Based Methods: If there are certain patterns in your data that can be captured by rules, you can use these rules to generate additional training data. This can be a quick and effective way to generate annotated data.

Remember, it's crucial to always validate the results of these methods with a separate, manually annotated test set to ensure the quality of your model's predictions.

We detail many of the concepts above in our Prodigy doc.

This also isn't a comprehensive list - there's lots more options, and you can likely find a lot more by searching on this forum.

Hope this helps!