Prodigy does come with an ner.mark
recipe that uses the boundaries
interface, which lets you highlight spans of text. You can see an example of this in the recipes overview. However, since marking entities manually is often unnecessarily tedious, you should only have to use this for edge cases or if your goal is to create gold-standard annotations.
To get over the “cold start problem” when training a new entity label, Prodigy lets you pass in a list of match patterns describing examples of the entities you’re looking for. Match patterns can include all properties available for spaCy’s rule-based matcher – so you can define single or multi-word tokens or use other linguistic annotations like part-of-speech tags. You can also use the terms.teach
and terms.to-patterns
recipe to create a terminology list from a number of seed terms using word vectors, and convert the list to match patterns.
When you start training, Prodigy uses the patterns to start suggesting entities and will collect the first set of examples to update the model in the loop. As the model improves, it will also start suggesting entities based on what it’s learned so far from the pattern matches.
We actually just recorded another video tutorial that shows an end-to-end example of training a new entity type from scratch starting off with only 3 seed terms:
You can find more details in the docs and this thread. I’ve also posted a quick TL;DR version of the workflow in this comment.