ner.teach does not suggest multiple tokens


this question is very closely related to How to score incompletely highlighted entities?.

I am trying to learn a new entity. The dataset looks like so:

{"text": " The notes will mature on August 15, 2018, and will be paid in U.S. dollars against presentation and surrender thereof at the corporate trust office of the Trustee. However, we may redeem the notes at our option prior to that date. See \"—Optional Redemption.\" The notes will not be entitled to the benefit of, and are not subject to, any sinking fund. ", "extra_info": {"start": 282609, "end": 283072, "filename": "/path/to/file", "sha512": "my_hash"}}
{"text": " the initial redemption date on or after which we may redeem the notes or the repayment date or dates on which the holders may elect repayment of the notes; ", "extra_info": {"start": 170838, "end": 171264, "filename": "/path/to/file", "sha512": "my_hash"}}
{"text": " The notes are redeemable at Citigroup\"s option, in whole, but not in part, on or after September 27, 2022, at a redemption price equal to 100% of the principal amount of the notes plus accrued and unpaid interest thereon to, but excluding, the date of redemption. In addition, Citigroup may redeem the notes prior to maturity if changes involving United States taxation occur which could require Citigroup to pay additional amounts, as described under \"Description of Debt Securities — Payment of Additional Amounts\" and \"— Redemption for Tax Purposes\" in the accompanying prospectus. ", "extra_info": {"start": 66141, "end": 66884, "filename": "/path/to/file", "sha512": "my_hash"}}
{"text": " If specified in the applicable prospectus supplement, TIFSA may redeem the debt securities of any series, as a whole or in part, at TIFSA\"s option on and after the dates and in accordance with the terms established for such series, if any, in the applicable prospectus supplement. If TIFSA redeems the debt securities of any series, TIFSA also must pay accrued and unpaid interest, if any, to the date of redemption on such debt securities. ", "extra_info": {"start": 373531, "end": 374087, "filename": "/path/to/file", "sha512": "my_hash"}}
{"text":  "We will pay contingent interest on the convertible senior notes after they have been outstanding at least ten years, under certain conditions. We may redeem the convertible senior notes once they have been outstanding for ten years at a redemption price of 100% of the principal amount of the notes, payable in cash. The optional repurchase dates, the common stock price conversion threshold amounts and the ending date of the first six-month period contingent interest may be payable for the contingent convertible senior notes are as follows: ", "extra_info": {"start": 454678, "end": 456968, "filename": "/path/to/file", "sha512": "my_hash"}}

My patterns.jsonl looks like so:

{"label": "aliases", "pattern": [{"lower": "the notes"}]}
{"label": "aliases", "pattern": [{"lower": "the existing 2021 notes"}]}
{"label": "aliases", "pattern": [{"lower": "the exchange notes"}]}
{"label": "aliases", "pattern": [{"lower": "the series 2012c senior notes"}]}
{"label": "aliases", "pattern": [{"lower": "the 2024 first mortgage bonds"}]}

I start training with the following command:

prodigy ner.teach Aliases en_core_web_md paragraph_content.jsonl --patterns patterns.jsonl --label aliases

During the training process, I never saw a multiple token suggestion. In the very beginning, I saw selections of only parts of my entities, but in the end I did not pay attention anymore.

What am I doing wrong here? How could I set up the training to obtain valuable suggestions.

I think the problem here is that none of your patterns ever match – so all you get to see is the model's suggestions, which are completely random because it has no idea of your label "aliases" yet. The token based patterns describe one token per dict – so in the example I quoted above, spaCy / Prodigy will be looking for one token whose lowercase text matches "the existing 2021 notes", which will obviously never be true, because that string consists of 4 tokens.

Instead, you could phrase the pattern like this:

{"label": "aliases", "pattern": [{"lower": "the"}, {"lower": "existing"}, {"lower": "2021"}, {"lower": "notes"}]}

Also keep in mind that the idea of the patterns is to write "patterns", i.e. abstract descriptions of the tokens. This pattern here will match the exact string "the existing 2021 notes" – but unless this is a super common phrase in your data, it likely won't produce good results.

Instead, you could take advantage of the other token attributes accepted by the Matcher – for example "is_digit": true to match tokens like "2021", but also "1999" or "10". Or "like_num": true, which would match both "10" but also "ten".

{"label": "aliases", "pattern": [{"is_digit": true}, {"lower": "notes"}]}

To test your patterns interactively and check whether they match the way you expect them to, check out our interactive matcher demo:

Finally, I'm not 100% sure the entity definition you're going for here makes sense. Named entities should be internally consistent categories of "real world objects" or concepts, ideally even proper nouns. In your case, the patterns describe pretty long phrases and sentence fragments. Teaching the existing model that sort of definition will be really difficult.

Instead, you might want to consider focusing on improving the existing predictions of the smaller components and then using rules or the dependency parse to resolve the rest of the phrase (if the desired result is "the eixsting 2021 notes"). For example, the model already has a pretty solid definition of DATE and ORDINALnumbers. So instead of trying to teach it a completely different analysis, you could work on improving these predictions and ideally, also the parser on your specific data. You can then use the dependency parse to get the rest: "2012" refers to "notes" and the head of this phrase is "existing", and its article "the". This is a much better approach than framing this as a named entity recognition task.

These threads goes into more detail on statistical predictions vs. rules:

Also separately linking @honnibal's talk on how to define NLP problems and solve them through iteration. It shows some examples of using Prodigy, and discusses approaches for framing different kinds of problems and finding out whether something is an NER task or maybe a better fit for text classification, or a combination of statistical and rule-based systems.

Basically, as I’m currently only assessing the usefulness of Prodigy, I try to find a more efficient, more generic way to extract text spans from texts. Up to know we either used Regular Expressions or Stanford NLP’s NER training, but both have shortcomings I want to overcome with Prodigy and Spacy.

I would like to save some improvements for when we finally bought Prodigy and I don’t face the upcoming end of trial period :-).

Your totally right. Please excuse my error here. I adapted my patterns. However, I am still very seldomly prompted with multi-token annotations, albeit I have already seen some that should match some of my patterns.

Maybe, to put it a bit differently: What would be the best way to introduce a new multi-token entity? It seems as if ner.manual does not work and ner.teach using patterns seems to have issues.

By match, do you mean exact matches or similar matches?

In the beginning, the model knows nothing, so it will usually suggest random single tokens. As you show it more examples of multiple token entities, it will adjust to that concept – but this usually requires a decent amount of examples. So what the patterns are trying to achieve is to help you get over the "cold start problem", i.e. to provide enough positive examples so that the model can learn something meaningful from your data and suggest examples that you can accept/reject to move it closer to the intended entity definition.

ner.teach with patterns would be the best way – assuming your patterns are generic enough and produce enough positive examples, and that the entity type is actually something the statistical model is able to learn from the context.

If you want to just find phrases in some a "enhanced regular expressions" kind of way without training a model in the loop, you can try ner.match instead. This will do the most straightforward thing, find pattern matches and ask you to accept or reject them. That type of matching is especially powerful if you take full advantage of the available token attribute, as I deascribed above. The data you create with this recipe can then be used to train a model later on.

If the spans you're trying to extract aren't typical "entities" but rather text fragments, it's usually a good idea to take a step back and focus on the smaller components – for example, dates, numbers, company names, persons etc. You can then improve the existing entitiy predictions and use the dependency parse tree to extract the full phrases you're interested in.

Here's an example using one of your texts:

A well-trained entity recognizer should predict "August 15, 2018" as a DATE entity. If you extract that Span object with spaCy, you can look at the surrounding tokens and the tokens that entity is attached to:

  • the date is attached to the adposition "on", as the prepositional object (pobj)
  • that prepositional phrase is attached to the verb "mature", the head of the sentence
  • the subject of mature is the noun "notes" with the article "the"

Thank you very much for these outstanding answers. I will need some to work it through with the attention it deserves.