Manual text typing

I’m trying to make a trainset with certain custom entities (namely, subject’s features), and the problem is that multiple-token entity can be scattered all over the annotated text, like so:

Prodigy is beautiful and efficient tool for all kinds of annotation tasks


  • is beautiful tool
  • is efficient tool
  • for all kinds of annotation tasks

I probably need some kind of manual text typing to annotate these kinds of things. Or maybe I need to do a syntactic analysis, extract noun chunks and then label them? Can you give me an advice how to do that?

Thanks. :slight_smile:

I think your idea of using the syntax is probably a good approach, especially since you might want to change the exact definition of what you’re collecting. As a rule of thumb, if it’s going to be hard to annotate the task with some combination of terms.teach, ner.teach and ner.manual, the NER model will probably struggle to learn the task from the annotations. So, I would recommend against a task definition that had you writing freely in text fields. It’ll make the annotation really slow, and at the end of it, the annotations probably won’t be internally consistent enough to train a useful model.

Here’s spaCy’s parse tree for your example sentence:

This is a nice example because it has attributes in three syntactic constructions:

  • Adjectival modifier

  • Adjectival modifier via conjunction

  • Prepositional phrase

These descriptions all apply to the entity Prodigy. At a guess, I think you probably only want descriptions of some types of phrases, and not others? If so, I think you’ll be best off targeting the annotation and machine learning towards identifying those entities, and then using rules to extract the attributes from the tree. We’ve been working on a dep.teach recipe as well, so if the default dependency trees aren’t accurate enough on your data, you’ll be able to correct them.

I think your rules will be cleanest if you have one filter per construction. Then you can test the filters individually, and adjust them if necessary. Here are example rules for the three types of construction. They assume a way of identifying the target entities.

def get_targets(doc):
    '''This is probably the function that you'll have to train a model for.'''
    return [ent.root for ent in doc]

def get_adjectives(doc, targets):
    target_indices = set(w.i for w in targets)
    for word in doc:
    if word.dep_ == 'amod' and word.head.i in target_indices:
        yield word

def get_adjective_conjuncts(doc, targets):
    adjective_indices = set(w.i for w i get_adjectives(doc, targets))
    for word in doc:
        if word.dep_ == 'conj' and word.head.i in adjective_indices:
            yield word

def get_prepositions(doc, targets):
    target_indices = set(w.i for w in targets)
    for word in doc:
        if word.dep_ == 'pobj' and word.head.i in target_indices:
            # Get a span for the word's subtree, to get the whole phrase.
            start = word.left_edge.i
            end = word.right_edge.i + 1
            yield doc[start : end]

To test and improve your current filters, have a look at the recipe. It uses a model to suggest entities, and the lets you correct them by highlighting spans. All you would need to do is replace the statistical span prediction the recipe does by default with your rule logic. Once you have a stream of examples with your suggested spans, you can start adding missing ones, or removing false positives. After you have enough examples your rules are getting wrong, you can adjust your rules and then evaluate your accuracy against your previous annotations.

If you use statistically predicted metadata (like the parse tree, entity labels etc), and improve your rules against a test set, rule-based approaches are actually really good for lots of tasks. Researchers aren’t interested in rule-based systems, rightly I feel — rules won’t advance the field in interesting ways. But that doesn’t mean they don’t work for specific problems.

Thank you, I’ll try to implement that pipeline and probably report the results.