I apologize, but I don't think I quite understand how to use Prodigy with custom entity types. I have been able to teach using the out of the box entities (i.e. Person, Org etc.), but can't figure out how to use the Prodigy UI to train on entities that I define in my annotations (i.e. Dog, Cat, Bird). I followed the steps to load my own data and annotations per the steps listed here:
but I am stuck after that. Could anyone offer any assistance?
Thanks for the question – this is actually perfect timing! We just released a video that shows a Prodigy workflow dealing with this exact topic. We’ve spent a lot of time tuning this workflow, and we’re really happy to see it working pretty well now! For more info and a quick summary, see this thread and this docs page. You can find more details on the recipes used on this page, or in your PRODIGY_README.html.
TL;DR
Note that this workflow requires Prodigy v1.1.0.
Use the terms.teach recipe with word vectors (e.g. a large spaCy model) to create a terminology list of examples of your new entity type. For a DOG entity, you could for example start off with the seed terms “labrador”, “golden retriever” and “poodle”. Based on the vectors, Prodigy will suggest you similar words to add to your list – for example, “corgi”.
Convert your terminology list to match patterns that can be loaded by spaCy’s Matcher using the terms.to-patterns recipe. This will give you a JSONL file with entries like {"label": "DOG", "pattern": [{"lower": "golden"}, {"lower": "retriever"}]}.
Collect annotations for the new entity type using ner.teach with your patterns file as the --patterns argument. The patterns are used to suggest entities found in your data – this helps you collect a bunch of relevant examples first, to get over the “cold start problem”. As the model in the loop improves, it will also start suggesting entities based on what it’s learned so far. You’ll probably want to collect a few hundred annotations before running the first training experiments.
Train your model using ner.batch-train and export it. Hopefully, you’ll now see a nice, initial accuracy score! You can also run ner.train-curve to see how accuracy improves with more data.
Test your exported model on real data (and make sure to use texts the model hasn’t seen during training). You can either use the ner.print-stream to get some nicely formatted command line output, or load it into spaCy using spacy.load('/path/to/model') and check out the doc.ents.
I then extracted the new jsonl and would have expected to find spans with the Brand labels. Instead I got this - a sample row out of the new jsonl
{“label”:“Brand”,“pattern”:[{“lower”:“We own two of the most highly recognized and most trusted brand names in the children’s apparel industry, Carter’s and OshKosh B’gosh (or “OshKosh”), and a leading baby and young child lifestyle brand, Skip Hop.”}]}
What am I doing wrong please?
Thank you in advance!
Or are you running terms.to-patterns somewhere instead? The output you posted looks exactly like the output of the terms.to-patterns recipe – here, Prodigy takes a dataset, only gets the "text" value and creates patterns based on the terms.
In this case yes, because you want to export the annotations and not create match patterns, right?
The terms.to-patterns recipe is for converting a dataset of words to patterns. You can then use those patterns to collect better annotations with ner.teach and other recipes – just like you did.
If you just want to export JSONL data from the database, you can use the db-out command:
Ah got it. Makes sense now that I am looking at the db-out output. One more question. None my annotations have two words - e.g. “Oshkosh Bigosh” or better multiple brands in the same sentence - e.g. We sell OskKosh Bigosh, Top Hat and Skippy brand products. Here I would have 3 brands with 2 of them being multiple words.
How do I over come that? Use the ner_manual annotation interface to manually annotate?
Note my initial seed terms had a few .e.g (I have shortened the terms below for brevity).
In your case, it sounds like the word vectors you used only included single tokens. This is why your patterns also only include single words. One solution for matching multi-word patterns is to simply edit the patterns manually, or create them based on a list or dictionary. The patterns format is pretty simple and you can either specify token patterns like "pattern": [{"lower": "top"}, {"lower": "hat"}] or exact strings like "pattern": "Top Hat".
At the moment, we don't have a built-in solution for phrase vectors – but we're working on integrating sense2vec, which should be really cool. This would let you train context vectors on your data, and then use terms.teach to find phrases that are similar to the ORG Top Hat (i.e. multi-word tokens and meaning in context).
This is also an option. The ner.manual recipe also respects pre-defined entities. And the annotations you export from Prodigy are in the same format as the stream you load into Prodigy. So you can always export your dataset and "post-annotate" it manually to fix mistakes. You probably want to filter only the accepted answers, though!