Train a new NER entity with multi-word tokens

Hi Stephan – the good news is, your idea sounds feasible and your approach makes sense. You also got the "hard parts" right – but there are a few small issues that caused it not to work:

The reason [{"lower":"volcano eruption"}] doesn't match is because each dictionary is describing one token. So your instinct of splitting the tokens into multiple dictionaries was correct, but the additional whitespace token isn't actually necessary. spaCy's tokenizer will split on whitespace characters – while it does preserve them in the .text_with_ws attribute to make sure no information is lost, they don't usually end up as one token – unless there are multiple of them. So the term "volcano eruption" will be tokenized as ['volcano', 'eruption']. So your pattern will have to look like this:

[{"lower": "volcano"}, {"lower": "eruption"}]

Adding a {"is_space": true} token means that spaCy will look for a token "volcano", followed by a whitespace token, followed by "eruption", which is almost never the case. So it would match "volcano \n eruption", but not "volcano eruption".

Because the patterns depend on spaCy's tokenization, you can verify them by running the text through spaCy's tokenizer, and looking at the individual tokens it produces:

>>> doc = nlp(u"volcano-eruption")
>>> [token.text for token in doc]
['volcano', '-', 'eruption']

Alternatively, Prodigy also supports spaCy's PhraseMatcher – so instead of token patterns, you can include strings. Internally, those will be converted to Doc objects, so you won't have to worry about spaCy's tokenization. You can find more about the patterns.json format in the "Match patterns" section of your PRODIGY_README.html.

{"label": "DISASTER", "pattern": "volcano eruption"}

terms.to-patterns is mostly intended to convert a list of seed terms – so it expects each example to contain a "text" key of the term that should be included in the pattern. This is usually the case if you create your seed terms from word vectors with terms.teach.

Your approach is pretty clever, though! To make it work, you can either check out the source of prodigy/recipes/terms.py and rewrite the terms.to-patterns recipe to take the text of each entry in the "spans". Or you export the annotations you've created with ner.manual to a JSONL file, convert it, save it out and import it to a new dataset, which you can then convert to patterns using terms.to-patterns:

terms = []
for eg in examples:  # the annotations created with ner.manual
    spans = eg.get('spans', [])  # get the annotated spans
    for span in spans:
        text = eg['text'][span['start']:span['end']]  # the highlighted text
        terms.append({'text': text, 'label': span['label']})

You can then save our your terms to JSONL and add them to a new dataset using the db-in command. You'll then have a set in the same format that's usually produced when creating the seed terms – for example {'text': 'volcano eruption', 'label': 'DISASTER'}.

1 Like