Pre-Train Spacy NER for healthcare data

Apache c-takes does a great job of identifying NER labels focused on healthcare data. Can I take the output NER labels (pre processed) from apache c-takes and feed to spacy’s word embedding model , using some json format, and create a general representation for accurate classification problems.

http://healthnlp.github.io/examples/

If you have an initial NER system that’s doing a good job, I would suggest creating a custom recipe which uses Prodigy to mark the suggested annotation as correct or incorrect. This can help you quickly create a vetted training data set which you can use to train spaCy or another tool.

It might also be possible to use the NER model in Apache c-takes directly in Prodigy’s ner.teach recipe. However, this might be also be a bit difficult. spaCy supports a fairly sophisticated training procedure to let it train on sparse annotations, where it doesn’t know the fully correct annotation for the sentences in the training data. Achieving the same result with another model may or may not be easy.

In summary, I would suggest the following workflow:

  1. Create a recipe that passes text through Apache c-takes. You can either ask about all the annotations on a sentence, or mark the annotations one-by-one.

  2. Use Prodigy’s ner.batch-train command to train a new spaCy model, which will be saved to disk.

  3. If you wish to improve accuracy further, you can use your newly trained spaCy model with the ner.teach command.

Hope that helps — let us know how you go :slight_smile: