[Request] best practice for bootstrapping data for training partially new Named Entites? (and a question about PhraseMatcher )

I'm building a new NER model.
is this sane?
(also a question about PhraseMatcher coming in in 1 minute)

Some entities exist in the base model (e.g GPE, LOC, DATE, TIME). - I'd like to use that when creating a silver training set.

Some entities are new.
of the new ones, several are found in custom databases, (tens of thousands of named entities belonging to one of several categories.)
Other new entities are some domain-specific terms.

I would like to bootstrap the annotations with patterns to speed things up for training the new model.

Here's my plan (and things I haven't been able to do):

  • For the Entities that exist in the base model, I can run the base model over a few hundred sample texts and annotate them with the existing model (then review these entities with ner.manual and e.g. --label GPE,LOC,DATE,TIME)
    (I guess at this point I could also be using teach, with that same base model that generated them, though I don't expect it to be quicker, being that it was a trained model to begin with en_core_web_trf was trained on a lot more than I can provide)
    This works

  • For the DB-based new entities, my plan was to run ner.manual with --pattern ./patt.jsonl after generating the patt.jsonl file that looks like this :
    {"label": "CUSTOM_L", "pattern": [{"LOWER": "token1"}, {"LOWER": "token2"}]}

    • I saw some comments how phrase matcher is faster, and tried to create a file that looks like this
      {"label": "CUSTOM_L", "pattern": "token1 token2"}
      {"label": "CUSTOM_L", "pattern": "Token1 Token2"}
      {"label": "CUSTOM_L", "pattern": "TOKEN1 TOKEN2"}
      but prodigy crapped out and wouldn't start even after 15 minutes. (it would start when I truncated that file to 1000 lines)
      i.e. this: prodigy ner.manual single_silver_ENT_A en_core_web_trf ./samples_s.jsonl --patterns ENT_A_PHRASES.jsonl --label ENT_A
      results in this
      Using 1 label(s): ENT_A
      and then nothing happens.
    • I thought of creating a pipeline inside Spacy and adding the PhraseMatcher, as a step in the pipeline similar to the entity_ruler - in the same capacity, but did not find a way to do it. Am I missing something ???
  • for the others I can use the LLM labeling bootstrapping method.

Create a few silver datasets, each with different entities, all on the same sample texts.
then use silver to gold, then train then correct.

(the nice diagram doesn't have silver to gold, or teach.. )

Hi @vish,

In general, combining various label bootstrapping methods the way you described it makes sense, for sure.

For the spaCy model-based preannotation:
Perhaps the easiest way about it would be to use ner.correct with en_core_web_trf. That would be pre-annotating and correcting in one step. How successful will that be depends on how similar your dataset is to the dataset used for training en_core_web_trf.
Since you need train your model from scratch (your final label set is much different from the en_core_web_trf and the amount of data is likely of different order of magnitude) there's not much added value in using ner.teach with the pre-trained model. Conversly, once you've created your dataset with the help of the pre-trained model, then it makes sense to train an initial custom model and use ner.teach so that it improves on the examples that are still challenging.

For the patterns based pre-annotation:
If you have very many patterns, you might consider doing the pre-annotation via script outside Prodigy - you don't really need the webapp to apply the pre-annotation logic. A script that would take in your input data and output Prodigy compatible jsonl with spans from spaCy pipeline. The spaCy pipeline may include only entity ruler or entity ruler and the pre-trained NER if that's convenient. I think that's more or less what you mean by:

I thought of creating a pipeline inside Spacy and adding the PhraseMatcher, as a step in the pipeline similar to the entity_ruler - in the same capacity, but did not find a way to do it. Am I missing something ???

You can just use entity ruler as it allows both token and phrase patterns. Here's an example script:

from prodigy.components.stream import get_stream
from prodigy.components.preprocess import add_tokens
from prodigy.types import StreamType
from pathlib import Path
from spacy.language import Language
import copy
import srsly

def get_nlp(patterns_dir: Path) -> Language:
    Create an entity ruler from pattern files.

        patterns_dir (Path): Path to the directory containing the pattern files.
    RETURNS (Language): A spaCy nlp object with an entity ruler.
    nlp = spacy.load(
        "en_core_web_trf", disable=["tagger", "parser", "attribute_ruler", "lemmatizer"]
    config = {"overwrite_ents": True} # we want to trust entity ruler more than the model
    ruler = nlp.add_pipe("entity_ruler", config=config, after="ner")
    for file in patterns_dir.iterdir():
        ruler.add_patterns([pattern for pattern in srsly.read_jsonl(file)])
    return nlp

def add_silver_annotations(stream: StreamType, nlp: Language) -> StreamType:
    Add spans to the stream based on Entity Ruler matches.
    stream (StreamType): The stream of examples.

        nlp (Language): The spaCy nlp object with an entity ruler.
    RETURNS (StreamType): The stream of examples with spans.
    relevant_labels = set() # define here labels you want from spaCy pipeline
    examples = ((eg["text"], eg) for eg in stream)
    for doc, eg in nlp.pipe(examples, as_tuples=True):
        task = copy.deepcopy(eg)
        spans = []
        for ent in doc.ents:
            if ent.label_ in relevant_labels: 
                          "token_start": ent.start,
                          "token_end": ent.end - 1,
                          "start": ent.start_char,
                          "end": ent.end_char,
                          "text": ent.text,
                          "label": ent.label_,
        task["spans"] = spans
        yield task

def main(patterns: Path, dataset: Path, output: Path):
    Preannotate examples with silver annotations from patterns
    and spaCy en_core_web_trf.

        patterns (Path): Path to the directory containing the pattern files.
        dataset (Path): Path to the Prodigy JSONL file to be preannotated.
        output(Path): Path to the output directory.
    stream = get_stream(dataset, input_key="text")
    nlp = get_nlp(patterns)
    stream.apply(add_tokens, nlp=nlp, stream=stream)
    stream.apply(add_silver_annotations, nlp=nlp, stream=stream)
    tasks = [task for task in stream]
    output.mkdir(parents=True, exist_ok=True)
    srsly.write_jsonl(Path(output_dir, "ner_preannotated.jsonl"), tasks)

The output would be a preannotated dataset, that would require curation with ner.manual. This manual curation will also be very important because it will let you understand your data much better as well as see how well your pre-annotation strategies fit your problem which will probably take a few iterations, especially in the case of LLMs.

After the curation, you'd be ready to train and correct and, optionally, teach to focus on the hardest examples.

1 Like

THANKS. (I should have said thanks earlier, this was helpful)

I've meanwhile gravitated to the spancat (for various reasons)
but my previous question about "PhraseMatcher" was a followup to several references I saw that mentioned PhraseMatcher to be more efficient.
is that still the case?
or was the comment relating to the fact that a phrase is much faster to match than a "pattern" composed of a list of TOKENS?

You're very welcome :slight_smile:

That's right, token-based matcher (spaCy Matcher) will be slower that phrase-based matcher (spaCy PhraseMatcher) because it analyzes all attributes of each individual token.
This snippet on spaCy docs elaborates a bit more on the topic: Rule-based matching · spaCy Usage Documentation

1 Like