Thanks!
Just to confirm: You've annotated a bunch of entities in context and only want to extract the entities as patterns? That's an interesting approach I haven't thought of before, but definitely possible. I can also see how it makes sense in some cases – e.g. if you want to see the entities in different contexts etc.
The data produced by Prodigy follows a simple JSON format – so you can always use Python (or any other language, really), to convert it and extract whatever you need from it. That's pretty important to Prodigy's philosophy – we don't want to lock you in, and you should always have access to your collected data in a format that's easy to work with. To see the format of the annotations, you can use the db-out
command:
prodigy db-out your_manual_ner_dataset | less
Each manually annotated entity is included as a "span"
– so you can extract
patterns = []
for eg in examples: # iterate over the examples
if eg['answer'] == 'accept': # you only want to use accepted answers
spans = eg.get('spans', [])
for span in spans:
start = span['start'] # start offset in original text
end = span['end'] # end offset in original text
label = span['label'] # assigned label
span_text = eg['text'][start:end] # slice of text
patterns.append({'label': label, 'pattern': span_text})
The above code will produce pattern entries that look like this:
{"label": "ANIMAL", "pattern": "tree kangaroo"}
If you want to produce token patterns (like [{"lower": "tree"}, {"lower": "kangaroo"}]
), you probably want to tokenize the span_text
with spaCy's tokenizer (the same model you used during manual annotation) and then create one token pattern for each tokenized string. If your entities are very simple and don't contain punctuation, you could also just split on whitespace.
You might also want to try pre-training your model on the already collected annotations, and then use this updated version as the base model for ner.teach
– plus the patterns generated from your annotations. This means that the model will already start off with some knowledge of your entity type, and you'll have the terminology list to help you find mentions in different contexts.
This is difficult to answer, because it really depends on your data
But as a rule of thumb, a few thousand annotations are usually a good start – sometimes more, though, depending on the complexity of the categories you're annotating. This is also the reason we've tried to offer different approaches and interfaces in Prodigy to help with this, which you can mix and match to see what works best. (For example, the patterns, terminology list from word vectors, fully manual annotation from scratch or ner.make-gold
to correct the model's predictions etc.)