Just released a new video, that shows the workflow for adding a new entity type from scratch. We’ve been fine-tuning the workflow for this for some time, so we’re super excited to have it working so well now
The video shows you how to create a terminology list with terms.teach
, starting from just three seed terms. After converting the word list to a pattern-match file using terms.to-patterns
, we use the patterns file to bootstrap a new entity type class, using ner.teach
. The neural network starts out with no examples of the class, but you get suggested matches from the patterns file built with terms.teach
. The suggestions you accept then become positive examples for the neural network. This is enough to get the model to start suggesting phrases too, which are mixed together with the pattern-matcher suggestions too. Before long, the statistical model takes over, and the normal active learning process can continue.
As an example of this boot-strapping process, we’ve trained a new entity recognition model to detect references to drugs in social media text. I’m hoping to use the model in a small data science project, using text from a large online community of opiate users. I want to look at how often different substances have been mentioned in these discussions over time, to see how the popularity of different substances such as synthetic opioids might relate to health outcomes such as overdose rates.