Hello,
I am trying to identify new entity ISSUER - which is similar to ORG. I tried using Spacy EntityRuler… it did identify the ISSUERS based on the patterns.
Following is the output:
In this context, State of California, County of Los Angeles are not ISSUERS… in some other cases they may be.
To address the above problem, I wanted to use net.match to get annotated data that can be used for training.
Used following command for annotation:
python -m prodigy ner.match issuer_ner en_core_web_sm cover_page_sentences.txt --patterns issuer.jsonl
cover_page_sentences.txt file has 57K lines.
Following is the excerpt from the above file:
NEW ISSUE Book Entry Only Moody s: Aa2 (stable outlook) S&P: AA- (stable outlook) (See Ratings herein) In the opinion of Bond Counsel, interest on the Series 2015 Bonds is, under law existing and in effect as of the date of the original issuance of the Series 2015 Bonds, (i) excluded from gross income of the holders thereof for purposes of federal income taxation, subject to the qualifications described herein under the heading TAX MATTERS , (ii) not an item of tax preference for purposes of the federal alternative minimum tax imposed on individuals and corporations; such interest, however, is includable in the adjusted current earnings in computing the federal alternative minimum tax imposed on certain corporations and (iii) exempt from present State of Alabama income taxation.
See TAX MATTERS herein for further information and certain other federal tax consequences arising with respect to the Series 2015 Bonds. $55,855,000 THE ALABAMA PUBLIC HEALTH CARE AUTHORITY LEASE REVENUE BONDS (DEPARTMENT OF PUBLIC HEALTH FACILITIES), SERIES 2015 Dated: Delivery Date Due: September 1, as shown on the inside cover The Series 2015 Bonds are limited obligations of the Issuer payable from rental payments to be received by the Issuer from the alabama Department of Public Health, a department of the State of Alabama (the Lessee ), pursuant to a Lease Agreement, dated as of September 1, 2005, as previously amended, and as supplemented by a First Supplement to Lease Agreement, dated as of March 1, 2015.
issuer.jsonl has around 2k patterns in the following format:
{“label”: “ISSUER”, “pattern”: “alabama federal aid highway finance authority”}
{“label”: “ISSUER”, “pattern”: “alabama incentives financing authority”}
{“label”: “ISSUER”, “pattern”: “alabama power co”}
{“label”: “ISSUER”, “pattern”: “alabama public health care authority”}
{“label”: “ISSUER”, “pattern”: “alabama public school & college authority”}
Based on the documentation, I expected following behavior:
-
Annotation would steam each line from the cover_page_sentences.txt
-
ISSUER will be highlighted in the text based on the patterns define in the issuer.josnl file
However following is what I see happening:
- The first annotation stream I get is following - it is the 53541 line, not the first one from the stream file
First Question that I have is: Why the streaming is starting from the first line?
After streaming four/five sentences, I get the “No Tasks available” message.
Right now, I am stuck… not sure how to solve this.
Please help!