I think your idea of using the syntax is probably a good approach, especially since you might want to change the exact definition of what you’re collecting. As a rule of thumb, if it’s going to be hard to annotate the task with some combination of terms.teach
, ner.teach
and ner.manual
, the NER model will probably struggle to learn the task from the annotations. So, I would recommend against a task definition that had you writing freely in text fields. It’ll make the annotation really slow, and at the end of it, the annotations probably won’t be internally consistent enough to train a useful model.
Here’s spaCy’s parse tree for your example sentence: https://demos.explosion.ai/displacy/?text=Prodigy%20is%20beautiful%20and%20efficient%20tool%20for%20all%20kinds%20of%20annotation%20tasks.&model=en_core_web_sm&cpu=1&cph=0
This is a nice example because it has attributes in three syntactic constructions:
These descriptions all apply to the entity Prodigy. At a guess, I think you probably only want descriptions of some types of phrases, and not others? If so, I think you’ll be best off targeting the annotation and machine learning towards identifying those entities, and then using rules to extract the attributes from the tree. We’ve been working on a dep.teach
recipe as well, so if the default dependency trees aren’t accurate enough on your data, you’ll be able to correct them.
I think your rules will be cleanest if you have one filter per construction. Then you can test the filters individually, and adjust them if necessary. Here are example rules for the three types of construction. They assume a way of identifying the target entities.
def get_targets(doc):
'''This is probably the function that you'll have to train a model for.'''
return [ent.root for ent in doc]
def get_adjectives(doc, targets):
target_indices = set(w.i for w in targets)
for word in doc:
if word.dep_ == 'amod' and word.head.i in target_indices:
yield word
def get_adjective_conjuncts(doc, targets):
adjective_indices = set(w.i for w i get_adjectives(doc, targets))
for word in doc:
if word.dep_ == 'conj' and word.head.i in adjective_indices:
yield word
def get_prepositions(doc, targets):
target_indices = set(w.i for w in targets)
for word in doc:
if word.dep_ == 'pobj' and word.head.i in target_indices:
# Get a span for the word's subtree, to get the whole phrase.
start = word.left_edge.i
end = word.right_edge.i + 1
yield doc[start : end]
To test and improve your current filters, have a look at the prodigy.recipes.ner.make_gold
recipe. It uses a model to suggest entities, and the lets you correct them by highlighting spans. All you would need to do is replace the statistical span prediction the recipe does by default with your rule logic. Once you have a stream of examples with your suggested spans, you can start adding missing ones, or removing false positives. After you have enough examples your rules are getting wrong, you can adjust your rules and then evaluate your accuracy against your previous annotations.
If you use statistically predicted metadata (like the parse tree, entity labels etc), and improve your rules against a test set, rule-based approaches are actually really good for lots of tasks. Researchers aren’t interested in rule-based systems, rightly I feel — rules won’t advance the field in interesting ways. But that doesn’t mean they don’t work for specific problems.