annotating entities in text documents

To add to Matt’s comment above, another suggestion: If you already have a lot of text and potential entities, you can also convert them to Prodigy’s JSONL format, and then use the mark recipe to simply collect annotations with no model in the loop or anything. The annotations will be stored in your dataset, and you can then use that to train a model – and keep improving by loading it into ner.teach and correcting its predictions.

A good strategy to extract entity candidates is to use spaCy v2.0’s new PhraseMatcher to find occurrences of words in terminology list in your data. (If you have word vectors trained on relevant data, you can also use Prodigy’s terms.teach recipe to help you select good candidates for your lists – or you just do it manually.) For example:

from spacy.matcher import PhraseMatcher

nlp = spacy.blank('de')  # create a blank German class
matcher = PhraseMatcher(nlp.vocab)  # initialise the matcher with the vocab

# Add match patterns for persons, orgs, whatever else you need
persons = ['Angela Merkel', 'Veronica Ferres', 'Dieter Bohlen']
matcher.add('PERSON', None, *[nlp(text) for text in persons])
orgs = ['Telekom', 'Vodafone', 'Lufthansa', 'Google']
matcher.add('ORG', None, *[nlp(text) for text in orgs])
# your data – ideally split into sentences or smaller chunks
texts = ['Some sentence', 'Another sentence', 'Lots of sentences']
examples = []  # store annotation examples here

for text in texts:
    doc = nlp(text)  # process the text
    matches = matcher(doc)  # find matches 
    # create spans for each match, using the match ID as the label, e.g.:
    # {'start': 0, 'end': 15, 'label': 'PERSON'}
    spans = [{'start': start, 'end': end, 'label': nlp.vocab.strings[m_id]}
            for m_id, start, end in matches]
    # add to list of examples in format supported by Prodigy
    examples.append({'text': text, 'spans': spans})

# write this to a .jsonl file
jsonl_data = '\n'.join([json.dumps(line) for line in examples])

Your data could then look like this:

{"text": "Angela Merkel zeigt sich optimistisch", "spans": [{"start": 0, "end": 11, "label": "PERSON"}]}

The .jsonl file can be loaded by Prodigy, and you can annotate them in context to confirm that they are indeed entities you’re looking for.

prodigy mark german_entities /path/to/my_data.jsonl

The above examples are probably fairly unambiguous – e.g. “Angela Merkel” will likely always be a PERSON entity. However, in the example of “Google”, you could easily end up with examples referring to both the company and the search engine (which you might want to label as a product, or something else). This really depends on your examples – so it’s always good to confirm the entities in context.