[Request] best practice for bootstrapping data for training partially new Named Entites? (and a question about PhraseMatcher )

vish · January 24, 2024, 9:55pm

I'm building a new NER model.
is this sane?
(also a question about PhraseMatcher coming in in 1 minute)

Some entities exist in the base model (e.g GPE, LOC, DATE, TIME). - I'd like to use that when creating a silver training set.

Some entities are new.
of the new ones, several are found in custom databases, (tens of thousands of named entities belonging to one of several categories.)
Other new entities are some domain-specific terms.

I would like to bootstrap the annotations with patterns to speed things up for training the new model.

Here's my plan (and things I haven't been able to do):

For the Entities that exist in the base model, I can run the base model over a few hundred sample texts and annotate them with the existing model (then review these entities with ner.manual and e.g. --label GPE,LOC,DATE,TIME)
(I guess at this point I could also be using teach, with that same base model that generated them, though I don't expect it to be quicker, being that it was a trained model to begin with en_core_web_trf was trained on a lot more than I can provide)
This works
For the DB-based new entities, my plan was to run ner.manual with --pattern ./patt.jsonl after generating the patt.jsonl file that looks like this :
{"label": "CUSTOM_L", "pattern": [{"LOWER": "token1"}, {"LOWER": "token2"}]}
THIS WORKS, BUT...
- I saw some comments how phrase matcher is faster, and tried to create a file that looks like this
  {"label": "CUSTOM_L", "pattern": "token1 token2"}
  {"label": "CUSTOM_L", "pattern": "Token1 Token2"}
  {"label": "CUSTOM_L", "pattern": "TOKEN1 TOKEN2"}
  but prodigy crapped out and wouldn't start even after 15 minutes. (it would start when I truncated that file to 1000 lines)
  i.e. this: prodigy ner.manual single_silver_ENT_A en_core_web_trf ./samples_s.jsonl --patterns ENT_A_PHRASES.jsonl --label ENT_A
  results in this
  Using 1 label(s): ENT_A
  and then nothing happens.
- I thought of creating a pipeline inside Spacy and adding the PhraseMatcher, as a step in the pipeline similar to the entity_ruler - in the same capacity, but did not find a way to do it. Am I missing something ???
for the others I can use the LLM labeling bootstrapping method.

Create a few silver datasets, each with different entities, all on the same sample texts.
then use silver to gold, then train then correct.

(the nice diagram doesn't have silver to gold, or teach.. )

magdaaniol · January 26, 2024, 3:17pm

Hi @vish,

In general, combining various label bootstrapping methods the way you described it makes sense, for sure.

For the spaCy model-based preannotation:
Perhaps the easiest way about it would be to use ner.correct with en_core_web_trf. That would be pre-annotating and correcting in one step. How successful will that be depends on how similar your dataset is to the dataset used for training en_core_web_trf.
Since you need train your model from scratch (your final label set is much different from the en_core_web_trf and the amount of data is likely of different order of magnitude) there's not much added value in using ner.teach with the pre-trained model. Conversly, once you've created your dataset with the help of the pre-trained model, then it makes sense to train an initial custom model and use ner.teach so that it improves on the examples that are still challenging.

For the patterns based pre-annotation:
If you have very many patterns, you might consider doing the pre-annotation via script outside Prodigy - you don't really need the webapp to apply the pre-annotation logic. A script that would take in your input data and output Prodigy compatible jsonl with spans from spaCy pipeline. The spaCy pipeline may include only entity ruler or entity ruler and the pre-trained NER if that's convenient. I think that's more or less what you mean by:

I thought of creating a pipeline inside Spacy and adding the PhraseMatcher, as a step in the pipeline similar to the entity_ruler - in the same capacity, but did not find a way to do it. Am I missing something ???

You can just use entity ruler as it allows both token and phrase patterns. Here's an example script:

from prodigy.components.stream import get_stream
from prodigy.components.preprocess import add_tokens
from prodigy.types import StreamType
from pathlib import Path
from spacy.language import Language
import copy
import srsly


def get_nlp(patterns_dir: Path) -> Language:
    """
    Create an entity ruler from pattern files.

    Args:
        patterns_dir (Path): Path to the directory containing the pattern files.
    RETURNS (Language): A spaCy nlp object with an entity ruler.
    """
    nlp = spacy.load(
        "en_core_web_trf", disable=["tagger", "parser", "attribute_ruler", "lemmatizer"]
    )
    config = {"overwrite_ents": True} # we want to trust entity ruler more than the model
    ruler = nlp.add_pipe("entity_ruler", config=config, after="ner")
    for file in patterns_dir.iterdir():
        ruler.add_patterns([pattern for pattern in srsly.read_jsonl(file)])
    return nlp


def add_silver_annotations(stream: StreamType, nlp: Language) -> StreamType:
    """
    Add spans to the stream based on Entity Ruler matches.
    stream (StreamType): The stream of examples.

    Args:
        nlp (Language): The spaCy nlp object with an entity ruler.
    RETURNS (StreamType): The stream of examples with spans.
    """
    relevant_labels = set() # define here labels you want from spaCy pipeline
    examples = ((eg["text"], eg) for eg in stream)
    for doc, eg in nlp.pipe(examples, as_tuples=True):
        task = copy.deepcopy(eg)
        spans = []
        for ent in doc.ents:
            if ent.label_ in relevant_labels: 
                  spans.append(
                      {
                          "token_start": ent.start,
                          "token_end": ent.end - 1,
                          "start": ent.start_char,
                          "end": ent.end_char,
                          "text": ent.text,
                          "label": ent.label_,
                      }
                  )
        task["spans"] = spans
        yield task

def main(patterns: Path, dataset: Path, output: Path):
    """
    Preannotate examples with silver annotations from patterns
    and spaCy en_core_web_trf.

    Args:
        patterns (Path): Path to the directory containing the pattern files.
        dataset (Path): Path to the Prodigy JSONL file to be preannotated.
        output(Path): Path to the output directory.
    """
    stream = get_stream(dataset, input_key="text")
    nlp = get_nlp(patterns)
    stream.apply(add_tokens, nlp=nlp, stream=stream)
    stream.apply(add_silver_annotations, nlp=nlp, stream=stream)
    tasks = [task for task in stream]
    output.mkdir(parents=True, exist_ok=True)
    srsly.write_jsonl(Path(output_dir, "ner_preannotated.jsonl"), tasks)

The output would be a preannotated dataset, that would require curation with ner.manual. This manual curation will also be very important because it will let you understand your data much better as well as see how well your pre-annotation strategies fit your problem which will probably take a few iterations, especially in the case of LLMs.

After the curation, you'd be ready to train and correct and, optionally, teach to focus on the hardest examples.

vish · February 14, 2024, 11:33am

THANKS. (I should have said thanks earlier, this was helpful)

I've meanwhile gravitated to the spancat (for various reasons)
but my previous question about "PhraseMatcher" was a followup to several references I saw that mentioned PhraseMatcher to be more efficient.
is that still the case?
or was the comment relating to the fact that a phrase is much faster to match than a "pattern" composed of a list of TOKENS?

magdaaniol · February 16, 2024, 12:52pm

You're very welcome

That's right, token-based matcher (spaCy Matcher) will be slower that phrase-based matcher (spaCy PhraseMatcher) because it analyzes all attributes of each individual token.
This snippet on spaCy docs elaborates a bit more on the topic: Rule-based matching · spaCy Usage Documentation

Topic		Replies	Views
NER or PhraseMatcher? ner , spacy , best-practices	17	6093	September 20, 2018
NER Training for Corporate Names ner , best-practices	22	11385	September 4, 2019
Training NER model from scratch using (forward-looking) patterns usage	8	692	December 17, 2019
Create PhraseMatcher in Spacy and use them to Label data manually ner , spacy , solved , medical	9	1569	December 15, 2020
Train a new NER entity with multi-word tokens usage , ner , solved	15	9673	September 10, 2019

[Request] best practice for bootstrapping data for training partially new Named Entites? (and a question about PhraseMatcher )

Related topics