Training a new parser or NER using a model with no lexeme normalization table. This may degrade the performance of the model to some degree...

I've created my own Language subclass since I need a custom tokenizer. I've basically copied the English language and overwritten create_tokenizer

from spacy.lang.en import Language, EnglishDefaults
from spacy.tokenizer import Tokenizer
from spacy.attrs import LANG
from spacy.util import compile_infix_regex, compile_prefix_regex, compile_suffix_regex

currencies = [
    "DKK",
    "SEK",
    "NOK",
    "GBP",
    "EUR",
    "USD",
    "CHF",
]


def _return_en_fin(_):
    return "en_fin"


class EnglishFinanceDefaults(EnglishDefaults):
    lex_attr_getters = EnglishDefaults.lex_attr_getters
    lex_attr_getters[LANG] = _return_en_fin

    @classmethod
    def create_tokenizer(cls, nlp=None) -> Tokenizer:
        prefixes = cls.prefixes + (
            r"[1-4][Qq]",
            "[Qq][1-4]",
            "[Hh]1",
            "1[Hh]",
            *[rf"{ccy}" for ccy in currencies],
            r"[\/'¹\[\]~]",
            r"-(?=\D)",  # tokenize "-" unless its a number
        )
        infixes = cls.infixes + (
            r"(?<=\d\d)[a-zA-Z]+",  # 2018Jan
            r"[\/'¹\[\]%]",  # 3/19, 18'4, US$
            r"(?<=\S)-",  # 1-Feb, Jan-30
            r"(?<=\d)(bn|BN|Bn|m|M|b|B)",
            r"[$£€]",
        )
        suffixes = cls.suffixes + (
            r"[1-4][Qq]",
            "[Qq][1-4]",
            "[Hh]1",
            "1[Hh]",
            *[rf"{ccy}" for ccy in currencies],
            r"[-\/'¹\[\]]",
            r"202[01]",
        )

        tokenizer = EnglishDefaults.create_tokenizer(nlp)
        tokenizer.prefix_search = compile_prefix_regex(prefixes).search
        tokenizer.infix_finditer = compile_infix_regex(infixes).finditer
        tokenizer.suffix_search = compile_suffix_regex(suffixes).search

        return tokenizer


class EnglishFinance(Language):
    lang = _return_en_fin("")
    Defaults = EnglishFinanceDefaults

Then I've saved the language to_disk and packaged it with spacy package. I've pip installed the packaged language model and it all works so far. The issue comes when I start training a new NER model using my new packaged model

❯ prodigy train-curve ner ner-period-date-month-year en_fin_model
✔ Starting with model 'en_fin_model'
Training 4 times with 25%, 50%, 75%, 100% of the data

=============================== ✨  Train curve ===============================
%      Accuracy   Difference
----   --------   ----------
/home/nixd/.cache/pypoetry/virtualenvs/annotator-U3km5bEc-py3.8/lib/python3.8/site-packages/spacy/language.py:635: UserWarning: [W033] Training a new parser or NER using a model with no lexeme normalization table. This may degrade the performance of the model to some degree. If this is intentional or the language you're using doesn't have a normalization table, please ignore this warning. If this is surprising, make sure you have the spacy-lookups-data package installed. The languages with lexeme normalization tables are currently: da, de, el, en, id, lb, pt, ru, sr, ta, th.

I suspect that it has to do with lex_attr_getters in EnglishFinanceDefaults but I'm not sure how I'm supposed to do it?

Question number 2

Lets say that I've labeled for four different labels but I only care about the performance on one or two of those labels. Is there an easy way to ignore some labels (or to just check performance on the labels of interest)?

To answer my own question on the second part; I created my own helper recipe command

from typing import List

from prodigy.core import recipe
from prodigy.components.db import connect
from prodigy.util import get_labels, msg


@recipe(
    "db-keep-labels",
    in_set=("Name of new dataset to be copied", "positional", None, str),
    out_set=(
        "Name of new dataset for the copied (and stripped) data",
        "positional",
        None,
        str,
    ),
    labels=(
        "Comma-separated label(s) to keep in the new dataset",
        "positional",
        None,
        get_labels,
    ),
    dry=("Perform a dry run", "flag", "D", bool),
)
def db_keep_labels(in_set: str, out_set: str, labels: List[str], dry: bool = False):
    """
    Copy from one dataset to a new while only keeping the provided labels
    """
    db = connect()

    if in_set not in db:
        msg.fail(f"Can't find dataset '{in_set}' in database", exits=1)
    if out_set in db:
        msg.fail(
            f"Output dataset '{out_set}' already exists and includes examples",
            f"This can lead to unexpected results. Please use a new dataset.",
            exits=1,
        )

    examples = db.get_dataset(in_set)

    new_data = [
        {**eg, "spans": [span for span in eg["spans"] if span["label"] in labels]}
        for eg in examples
    ]
    if not dry:
        db.add_dataset(out_set)
        db.add_examples(new_data, datasets=[out_set])

    msg.good(
        f"Inserted {len(new_data)} examples. From {len([span for eg in examples for span in eg['spans']])} to {len([span for eg in new_data for span in eg['spans']])} spans",
        f"Created new dataset '{out_set}'",
    )

You shouldn't need to create a new language to customize the tokenizer. Install spacy-lookups-data to have the lexeme normalization table available, create a new blank English model with spacy.blank("en"), customize the tokenizer however you'd like, and save it with nlp.to_disk("/path/to/model"). Then provide this model as the model option for prodigy.

You can use an English model without the normalization table, but in our experience it does help the performance slightly because it covers a lot of spelling variants. You may not benefit from this, though, depending on your data.

So I already had spacy-lookups-data installed but I assume it wouldn't link to my own custom language. I thought I had to create my own Language to be able to get my custom tokenizer through spacy.load but great that I don't need to. Did this change recently?

And thank you for the answer!

The spacy-lookups-data tables are loaded based on the language (en), so you're right that if you have a custom language like en_whatever, it won't load anything. (You could add tables to a local install of spacy-lookups-data with the right entry points if you wanted, though.)

spacy.load() can load from a directory or a package. If you're customizing things for prodigy, it can be easiest to work with a model in a local directory. (You can package it with spacy package and install it if you want, but it's not necessary.) This isn't anything new or different.