Patterns and custom NER

Hello,
I want to create by Own LABEL for named entity extraction, I have a set of terms for example “surcharge”, “tax”, “tax liability”
Just want to be clear on the steps:

1- Create a Database Mydatabase
2- Create a Jsonl file with records {“label”:“MYLABEL”,“pattern”:[{“lower”:“surcharge”}]}
3- call ner.teach Mydatabase en_core_web_lg trainingfile.txt --label MYLABEL --patterns myjson.jsonl

couple of questions

  1. where can I find a complete list of patterns in documentation is this in spacy or prodigy docs?
  2. If I want MYLABEL to tag two consecutive words together “limited liability” a ngram of size two can I chain patterns?
    3)Since I am not building a --seed set rather using a set of predefined discrete words based on patterns do I need en_core_web_lg, I see you use this only in the context of expanding the seed set

Sorry for the delayed response – answers below! Your workflow looks good and all correct btw.

Yes, you can find examples of the JSONL format in the "Input formats" section of the Prodigy README: PRODIGY_README.html#match-patterns. Since the "patterns" key can have the same format as spaCy's match patterns, you can also check out the spaCy docs on rule-based matching for more examples. Keep in mind that all spaCy examples are written in Python, so you'll obviously have to convert them to JSON first (double quotes, lowercase true and false etc).

A pattern is a list of dictionaries where one dictionary describes one token. So to match "limited liability", your pattern entry could look like this:

{"label": "MYLABEL", "pattern": [{"lower": "limited"}, {"lower": "liability"}]}

This will match "limited liability", "Limited Liability", "LiMiTeD LiAbIlItY" etc. You can also use "orth" to match on exact strings instead of the lowercase form, or use other token attributes like is_punct or the token's shape.

Yes, in the example, we mostly use the lg model because it includes more word vectors for bootstrapping. If you don't use this step, you can also choose a different model. Especially for the first experiments, it's often easier to use the sm models, as they load and serialize faster.