An update for posterity! Nothing below is urgent, so hope you all have a great holiday.
For one new label and using domain-specific texts:
- Created and reviewed 625 annotations from about 17k patterns (generated from some publicly available data) and about 17k texts. This exhausted the matched patterns so now we're on to unmatched.
- Decided to treat self by checking out the training curve, and glad I did: using those 625 examples and the en_vectors_web_lg starter model, we're at 89.24 accuracy (F Score?) and set to improve with further annotations. This is astounding.
Reflections:
- Really happy we went with manual annotation-- I reasoned that even if we didn't end up with a useful model, we'd at least have a source of truth for the current set of texts. But it looks like we will end up with a model.
- The accuracy score is so high as to be both astounding and to make me a bit skeptical, so I'll be doing some digging. Certainly, we'll be sanity-checking this model against additional examples, and I'm looking forward to seeing how further manual non-pattern-matched annotations affect the accuracy.
- I'm not convinced that I have a source of text that'd be worth pretraining against, nor do I have the ready GPU access to do some quick tests. So my instinct is to hold off on that and continue annotation.
Some new questions:
- Related to GloVe vectors, is en_vectors_web_lg derived from the 840B Common Crawl token set here? I couldn't tell for sure from the SpaCy docs or repo.
- I've had some success using the xx_wiki_ent_sm model for NER in the past and think wiki vectors might work better for my use case. If one wanted to train against a different GloVe download such as the wiki vectors, should one still use this script for conversion and then drop in the path to the output at the command line? (h/t and thanks @justindujardin). I am happy to test it out but am just curious as to whether SpaCy internals/io for vectors have shifted in the meantime, off the top of your head.
- I found a comment on this forum a helpful way to think about ner.teach: it's a great annotator-helper. With that in mind, might we use ner.teach with a model based on the matched patterns to annotate the texts that didn't have matching patterns, or best to soldier on with ner.manual? As I said earlier, my instinct is to stick with what's working and get to some ground truth for use down the line, but interested in your take.
Consider this validation for your encouragement here and elsewhere that ner.manual is a great place to start when you have a new label, and hats off to the s2v blog post for lighting the way!
Thanks very much for all you and the team do, and Happy New Year.