Patterns and custom NER

rkeyvani · December 27, 2017, 1:29am

Hello,
I want to create by Own LABEL for named entity extraction, I have a set of terms for example “surcharge”, “tax”, “tax liability”
Just want to be clear on the steps:

1- Create a Database Mydatabase
2- Create a Jsonl file with records {“label”:“MYLABEL”,“pattern”:[{“lower”:“surcharge”}]}
3- call ner.teach Mydatabase en_core_web_lg trainingfile.txt --label MYLABEL --patterns myjson.jsonl

couple of questions

where can I find a complete list of patterns in documentation is this in spacy or prodigy docs?
If I want MYLABEL to tag two consecutive words together “limited liability” a ngram of size two can I chain patterns?
3)Since I am not building a --seed set rather using a set of predefined discrete words based on patterns do I need en_core_web_lg, I see you use this only in the context of expanding the seed set

ines · December 27, 2017, 12:25pm

Sorry for the delayed response – answers below! Your workflow looks good and all correct btw.

Yes, you can find examples of the JSONL format in the "Input formats" section of the Prodigy README: PRODIGY_README.html#match-patterns. Since the "patterns" key can have the same format as spaCy's match patterns, you can also check out the spaCy docs on rule-based matching for more examples. Keep in mind that all spaCy examples are written in Python, so you'll obviously have to convert them to JSON first (double quotes, lowercase true and false etc).

A pattern is a list of dictionaries where one dictionary describes one token. So to match "limited liability", your pattern entry could look like this:

{"label": "MYLABEL", "pattern": [{"lower": "limited"}, {"lower": "liability"}]}

This will match "limited liability", "Limited Liability", "LiMiTeD LiAbIlItY" etc. You can also use "orth" to match on exact strings instead of the lowercase form, or use other token attributes like is_punct or the token's shape.

Yes, in the example, we mostly use the lg model because it includes more word vectors for bootstrapping. If you don't use this step, you can also choose a different model. Especially for the first experiments, it's often easier to use the sm models, as they load and serialize faster.

Topic		Replies	Views
(Re)using labels in patterns usage , spacy	1	316	July 21, 2021
concepts representation usage , ner	4	375	October 11, 2020
ner.manual: issue to recognize multi-words entity containing "-" usage , spacy , solved	2	308	June 15, 2021
Pre-annotate entities with patterns usage , ner , solved	6	762	January 11, 2023
Create PhraseMatcher in Spacy and use them to Label data manually ner , spacy , solved , medical	9	1564	December 15, 2020

Patterns and custom NER

Related topics