EntityRuler and ner.match - different behavior

Hello,

I am trying to identify new entity ISSUER - which is similar to ORG. I tried using Spacy EntityRuler… it did identify the ISSUERS based on the patterns.

Following is the output:

image

In this context, State of California, County of Los Angeles are not ISSUERS… in some other cases they may be.

To address the above problem, I wanted to use net.match to get annotated data that can be used for training.

Used following command for annotation:
python -m prodigy ner.match issuer_ner en_core_web_sm cover_page_sentences.txt --patterns issuer.jsonl

cover_page_sentences.txt file has 57K lines.

Following is the excerpt from the above file:

NEW ISSUE Book Entry Only Moody s: Aa2 (stable outlook) S&P: AA- (stable outlook) (See Ratings herein) In the opinion of Bond Counsel, interest on the Series 2015 Bonds is, under law existing and in effect as of the date of the original issuance of the Series 2015 Bonds, (i) excluded from gross income of the holders thereof for purposes of federal income taxation, subject to the qualifications described herein under the heading TAX MATTERS , (ii) not an item of tax preference for purposes of the federal alternative minimum tax imposed on individuals and corporations; such interest, however, is includable in the adjusted current earnings in computing the federal alternative minimum tax imposed on certain corporations and (iii) exempt from present State of Alabama income taxation.

See TAX MATTERS herein for further information and certain other federal tax consequences arising with respect to the Series 2015 Bonds. $55,855,000 THE ALABAMA PUBLIC HEALTH CARE AUTHORITY LEASE REVENUE BONDS (DEPARTMENT OF PUBLIC HEALTH FACILITIES), SERIES 2015 Dated: Delivery Date Due: September 1, as shown on the inside cover The Series 2015 Bonds are limited obligations of the Issuer payable from rental payments to be received by the Issuer from the alabama Department of Public Health, a department of the State of Alabama (the Lessee ), pursuant to a Lease Agreement, dated as of September 1, 2005, as previously amended, and as supplemented by a First Supplement to Lease Agreement, dated as of March 1, 2015.

issuer.jsonl has around 2k patterns in the following format:

{“label”: “ISSUER”, “pattern”: “alabama federal aid highway finance authority”}
{“label”: “ISSUER”, “pattern”: “alabama incentives financing authority”}
{“label”: “ISSUER”, “pattern”: “alabama power co”}
{“label”: “ISSUER”, “pattern”: “alabama public health care authority”}
{“label”: “ISSUER”, “pattern”: “alabama public school & college authority”}

Based on the documentation, I expected following behavior:

  • Annotation would steam each line from the cover_page_sentences.txt

  • ISSUER will be highlighted in the text based on the patterns define in the issuer.josnl file

However following is what I see happening:

  • The first annotation stream I get is following - it is the 53541 line, not the first one from the stream file
    image
    First Question that I have is: Why the streaming is starting from the first line?

After streaming four/five sentences, I get the “No Tasks available” message.

Right now, I am stuck… not sure how to solve this.

Please help!

Your workflow sounds correct and there’s actually very little magic going on in the ner.match recipe. Is your dataset empty? One possible explanation for lines being skipped is that examples are already present in the dataset and are filtered out.

Another thing I noticed in your patterns: They’re all lowercase and if you’re mapping "pattern" to a string, those will be exact string matches. So maybe “merill lynch” is actually the first lowercase spelling it encounters and actually the first match? If you do want case-insensitive matches, you might want to consider using the token-based syntax and use entries like [{"lower": "merill"}, {"lower": "lynch}]. You can convert them automatically like this:

for line in patterns:
    doc = nlp.make_doc(line["pattern"])
    token_pattern = [{"lower": token.lower_} for token in doc]

Another suggestion: If you do have a model with an entity ruler, you can also just load that into ner.make-gold. That recipe will stream whatever is present in the doc.ents for a given label – so that will also include your entity ruler matches.

Finally, I’m not sure what your end goal is, but if your plan is to update an existing named entity recognizer, this could potentially be quite difficult and will likely require a lot of data. Many of the entities that you consider an ISSUER are things that the model would previously have predicted as ORG or maybe LOCATION or GPE. So you’d essentially be working against pretty much all the existing weights, which were trained on ~2 million words. So you might actually find that it’s easier to train a model from scratch – and maybe take advantage of the existing predictions to help you label your data.

2 Likes

Thanks for the quick reply!

I will try out your suggestions.

Another thing that I tried is ner.manual - it is streaming all the sentences in sequence. This is perplexing.

If I chose to take advantage of existing prediction ORG, can I specify the patterns? I pretty much know all the names of the ISSUERS. And reject all other ORG entities?

Hi @ines, I've got a few questions to clarify.

  1. Should the patterns in EntityRuler be constructed GENERALLY to capture a wide coverage of text spans (eg. {"IS_ALPHA": True}, {"IS_PUNCT": True}), or should it be constructed in a specific manner (eg. {"LOWER": "hello"}, {"LOWER": "world"})?
    I ask this because the model after all has a statistical model that can predict general cases pretty well already (I'm referring to training a same LABEL for rule-based and statistical model at the same time)..

  2. Should EntityRuler be placed before, or after 'ner' in the pipeline?

  3. In Prodigy's recipe ner.make-gold, when I make changes to the model's prediction, does it edit the 'ner' pipeline, and has no interference to the EntityRuler?

Thanks!

@Anji.Vaidyula It really sounds like the problem is the capitalisation then!

Are you sure you want to train a statistical model for this, then? If you haven't done it already, it might be worth running a quick evaluation to see what your baseline is. For instance, if you're getting to a 95% accuracy using only your rules, training a model may be kind of a waste of time.

Oh, I meant you could do something like: extract all ORG entities and save the result in Prodigy's JSONL. Then change the label from ORG to ISSUER and load it into Prodigy and annotate the examples. Even if only 25% of those orgs are issuers, that's still 25% less manual work for you.

This depends on what you're trying to do. The very abstract patterns are unlikely to accurately capture the spans you're looking for. So it really comes down to what gives you the best results.

Both is possible and has different implications. See the documentation here: Rule-based matching · spaCy Usage Documentation

The entity ruler is designed to integrate with spaCy’s existing statistical models and enhance the named entity recognizer. If it’s added before the "ner" component , the entity recognizer will respect the existing entity spans and adjust its predictions around it. This can significantly improve accuracy in some cases. If it’s added after the "ner" component , the entity ruler will only add spans to the doc.ents if they don’t overlap with existing entities predicted by the model. To overwrite overlapping entities, you can set overwrite_ents=True on initialization.

If you just annotate, the model won't be changed at all. If you do use that data to train from later on, the statistical model will be updated, not the entity ruler (which is really just a collection of static rules). My comment here explains some of this in more detail:

inesInes Montani

23m

@Anji.Vaidyula It really sounds like the problem is the capitalisation then!

Anji.Vaidyula:

I pretty much know all the names of the ISSUERS.

Are you sure you want to train a statistical model for this, then? If you haven’t done it already, it might be worth running a quick evaluation to see what your baseline is. For instance, if you’re getting to a 95% accuracy using only your rules, training a model may be kind of a waste of time.

Problem that I am trying to solve is: some of the names are not ISSUERS - it depends on the context. In the following example, LOS ANGELES UNIFIED SCHOOLD DISTRICT is the ISSUER. State of California and Count of Los Angeles are not... however in some cases, State of California would be the ISSUER.

If we I use PhraseMatcher then it will highlight all the three, then I need to figure out which one is the true ISSUER. To resolve this, I thought train a model to identify the ISSUER based on context and boot strap with ISSUER list for annotation. Is there a better way to solve this problem?

inesInes Montani

3h

@Anji.Vaidyula It really sounds like the problem is the capitalisation then!

Yes. it is indeed capitalization problem. Once, I tokenized as per your suggestion, it is working great! Thanks.