ner.teach does not suggest multiple tokens

ines · October 15, 2018, 11:48am

I think the problem here is that none of your patterns ever match – so all you get to see is the model's suggestions, which are completely random because it has no idea of your label "aliases" yet. The token based patterns describe one token per dict – so in the example I quoted above, spaCy / Prodigy will be looking for one token whose lowercase text matches "the existing 2021 notes", which will obviously never be true, because that string consists of 4 tokens.

Instead, you could phrase the pattern like this:

{"label": "aliases", "pattern": [{"lower": "the"}, {"lower": "existing"}, {"lower": "2021"}, {"lower": "notes"}]}

Also keep in mind that the idea of the patterns is to write "patterns", i.e. abstract descriptions of the tokens. This pattern here will match the exact string "the existing 2021 notes" – but unless this is a super common phrase in your data, it likely won't produce good results.

Instead, you could take advantage of the other token attributes accepted by the Matcher – for example "is_digit": true to match tokens like "2021", but also "1999" or "10". Or "like_num": true, which would match both "10" but also "ten".

{"label": "aliases", "pattern": [{"is_digit": true}, {"lower": "notes"}]}

To test your patterns interactively and check whether they match the way you expect them to, check out our interactive matcher demo:

Finally, I'm not 100% sure the entity definition you're going for here makes sense. Named entities should be internally consistent categories of "real world objects" or concepts, ideally even proper nouns. In your case, the patterns describe pretty long phrases and sentence fragments. Teaching the existing model that sort of definition will be really difficult.

Instead, you might want to consider focusing on improving the existing predictions of the smaller components and then using rules or the dependency parse to resolve the rest of the phrase (if the desired result is "the eixsting 2021 notes"). For example, the model already has a pretty solid definition of DATE and ORDINALnumbers. So instead of trying to teach it a completely different analysis, you could work on improving these predictions and ideally, also the parser on your specific data. You can then use the dependency parse to get the rest: "2012" refers to "notes" and the head of this phrase is "existing", and its article "the". This is a much better approach than framing this as a named entity recognition task.

These threads goes into more detail on statistical predictions vs. rules:

Also separately linking @honnibal's talk on how to define NLP problems and solve them through iteration. It shows some examples of using Prodigy, and discusses approaches for framing different kinds of problems and finding out whether something is an NER task or maybe a better fit for text classification, or a combination of statistical and rule-based systems.

Topic		Replies	Views
Train a new NER entity with multi-word tokens usage , ner , solved	15	9675	September 10, 2019
Multi-word entity seeding, entity context usage , ner	19	3961	November 1, 2019
ner.teach not giving relevant entities from patterns jsonl ner , done	21	2846	October 2, 2018
ner.teach suggests spaces as entities? usage , ner , solved	13	1674	August 3, 2018
NER Training for Corporate Names ner , best-practices	22	11391	September 4, 2019

ner.teach does not suggest multiple tokens

Related topics