How do I add a --patterns option to ner.make-gold?

ner
solved

(W.P. McNeill) #1

I want to use ner.make-gold to annotate new entity types, except in addition to getting candidate entities from the model I also want to get them from a pattern file. I want to get a mixture and model and pattern candidate entities the same as I do in ner.teach.

I am trying to write a custom ner.make-gold recipe that does this. First I added a patterns option and use it to create a PatternMatcher object.

from prodigy.models.matcher import PatternMatcher
...
matcher = PatternMatcher(model.nlp).from_disk(patterns)

I then use this pattern matcher inside the make_tasks function. I’m finding it hard to make this work.

I can apply it to the incoming stream

for score, matches in matcher(stream):
    ....

But I have to do some work to synchronize the generator coming out of this with the original task stream so that I can deepcopy and alter the task structures. I’m starting to do lots of tricky stuff with tee which makes me suspect that I might be doing things the hard way.

I’m also not sure how to combine model and pattern predictions. I tried using combine_models(model, matcher) but the predictor I got out only seemed to be using the model and not the patterns. In general I find this confusing because there are multiple ways to get Prodigy/spaCy to suggest a named entity (e.g. model.predict, predict returned by combine_models, nlp.pipe, matcher(stream)) and I am not clear on the differences between them and which one I should be using.

How should I add a --patterns option to ner.make-gold?


"generator already executing" when operating on a stream
(Ines Montani) #2

The combine_models helper is mostly relevant if you have two models and want to interleave their predictions and be able to update them in the loop. While you could do this, it seems like your use case is much simpler – you only want to show the results of both the model and the patterns, right?

So a simpler approach would be to slightly adapt the current ner.make-gold recipe and add a Matcher. If you look at the make_tasks function within the recipe, you’ll see that the logic is actually quite simple: the incoming examples are piped through the nlp object and each entity in doc.ents is then added to the annotation task’s as a "span" property.

Your custom recipe could load your patterns from the file, sort them by label (so you’ll only have to add one entry to the matcher for each label) and then add the patterns for each label to the Matcher:

patterns = []  # load your patterns here
sorted_patterns = {}  # sort patterns by label here
for entry in patterns:
    sorted_patterns.setdefault(entry['label'], []).append(entry['pattern'])
matcher = Matcher(nlp.vocab)
for label, patterns in sorted_patterns:
    # add patterns for each label and use the label as the match ID
    matcher.add(label, None, *patterns)

When you’re iterating over the docs / examples in make_tasks, you can now also get the matches for the doc. Creating a Span object will give you the same attributes that entities within doc.ents have – including the start_char, end_char and start and end index. Those are needed within the annotation task, so Prodigy can map the span to the respective tokens.

from spacy.tokens import Span

spans = []
matches = matcher(doc)
for label, start, end in matches:
    ent = Span(doc, start, end, label)
    spans.append({ ... })  # create the annotation task span here

You could probably also make the function a little more elegant by adding all entity Spans created by the matcher to a list, together with list(doc.ents) and then iterating over them once to create the annotation task spans. You’ll also want to filter out duplicate spans that are present in both the doc.ents and the matches. If there’s any other custom filtering you want to do, you can also just apply it in this function.

Anyway, this is actually a nice feature idea for the next release! :+1:


(W.P. McNeill) #3

So should I be using

spacy.matcher.Matcher

and not

prodigy.models.matcher.PatternMatcher

(Ines Montani) #4

Yes, exactly. spaCy’s Matcher has everything you need – it can read in the token-based patterns, and find matches in the document.

Prodigy’s PatternMatcher is a wrapper model for spaCy’s Matcher and PhraseMatcher that includes extra functionality to assign probabilities and update them based on the accept/reject rate of the matches. However, as I mentioned above, this is mostly relevant if you’re annotating with a model in the loop.


(W.P. McNeill) #5

I ended up writing this. Feel free to incorporate it into a later version of Prodigy.

@recipe('ner.make-gold',
        dataset=recipe_args['dataset'],
        spacy_model=recipe_args['spacy_model'],
        source=recipe_args['source'],
        api=recipe_args['api'],
        loader=recipe_args['loader'],
        patterns=recipe_args['patterns'],
        labels=recipe_args['label_set'],
        exclude=recipe_args['exclude'],
        unsegmented=recipe_args['unsegmented'])
def make_gold(dataset, spacy_model, source=None, api=None, loader=None,
              patterns=None, labels=None, exclude=None, unsegmented=False):
    """
    Create gold data for NER by correcting a model's suggestions.
    """
    log("RECIPE: Starting recipe ner.make-gold", locals())
    nlp = spacy.load(spacy_model)
    log("RECIPE: Loaded model {}".format(spacy_model))

    patterns_by_label = {}
    for entry in read_jsonl(patterns):
        patterns_by_label.setdefault(entry['label'], []).append(entry['pattern'])
    matcher = Matcher(nlp.vocab)
    for pattern_label, patterns in patterns_by_label.items():
        matcher.add(pattern_label, None, *patterns)

    # Get the label set from the `label` argument, which is either a
    # comma-separated list or a path to a text file. If labels is None, check
    # if labels are present in the model.
    if labels is None:
        labels = set(get_labels_from_ner(nlp) + patterns_by_label.keys())
        print("Using {} labels from model: {}"
              .format(len(labels), ', '.join(labels)))
    log("RECIPE: Annotating with {} labels".format(len(labels)), labels)
    stream = get_stream(source, api=api, loader=loader, rehash=True,
                        dedup=True, input_key='text')
    # Split the stream into sentences
    if not unsegmented:
        stream = split_sentences(nlp, stream)
    # Tokenize the stream
    stream = add_tokens(nlp, stream)

    def make_tasks():
        """Add a 'spans' key to each example, with predicted entities."""
        texts = ((eg['text'], eg) for eg in stream)
        for doc, eg in nlp.pipe(texts, as_tuples=True):
            task = copy.deepcopy(eg)
            spans = []
            matches = matcher(doc)
            pattern_matches = tuple(Span(doc, start, end, label) for label, start, end in matches)
            for ent in doc.ents + pattern_matches:
                if labels and ent.label_ not in labels:
                    continue
                spans.append({
                    'token_start': ent.start,
                    'token_end': ent.end - 1,
                    'start': ent.start_char,
                    'end': ent.end_char,
                    'text': ent.text,
                    'label': ent.label_,
                    'source': spacy_model,
                    'input_hash': eg[INPUT_HASH_ATTR]
                })
            task['spans'] = spans
            task = set_hashes(task)
            yield task

    return {
        'view_id': 'ner_manual',
        'dataset': dataset,
        'stream': make_tasks(),
        'exclude': exclude,
        'update': None,
        'config': {'lang': nlp.lang, 'labels': labels}
    }

(W.P. McNeill) #6

FWIW my confusion here was an instance of more general confusion about how to write custom recipes. Prodigy has good documentation and examples, but I still can’t always wrap my head around how to approach things. I want Prodigy to be a layer built on top of spaCy that I could use without knowing spaCy, but by design it’s not. That’s fine, but now I have a hard time building a mental model of where the lines between them should be drawn.

For example, here at first I got spacy.matcher.Matcher confused with prodigy.models.matcher.PatternMatcher. I had literally overlooked the fact that there were two different classes until you answered my initial question. Of course the difference is clearly documented, but it’s still a mistake I tend to make when I’m copy-pasting code and class names are similar. Then when I realized my mistake I went back and tried to tell myself a story like “PatternMatcher = Matcher + scoring”. Except there are other differences (Matcher() takes a document as an argument where PatternMatcher take an iterator over tasks) that I can’t concisely characterize to myself. Or for example, the confusion I mention above where I think (correctly or no) that there are four different ways to get a named entity from text.

I realize that’s a little vague and might just boil down to me saying “I’m confused”. If I manage to be precisely characterize what I’m confused about, I’ll post it on the board.


(Matthew Honnibal) #8

@wpm Thanks for this; it’s quite thought-provoking.

Sometimes I battle between two perspectives. On the one hand, I think: isn’t it amazing what we can do with ML and NLP? Things that used to be multi-year research project pilot studies can now be shipped by one or two people, often with fairly minimal experience. But then other times I think: this stuff barely works, none of it makes sense, and my best advice often boils down to “I don’t know, have you tried turning it off and on again?”. It’s sort of this Lovecraftian thing: sometimes it feels like I’m in a world of stability and order, but then I hear whispers from the void that tell of madness that lies beneath.

In Prodigy we’ve tried to make the “happy path” quite painless, while also trying to make sure that all the underlying bits are exposed, so you can swap things out and play with the internals. We want to give people a command that’s like, “Train the text classifier on your batch of annotations”. So I build a model that’s usually good for that, and try to give it sensible defaults. But that’s not the last word on text classification! Sometimes you do need to do something completely different. That’s why we have these wrappers. We don’t even want to tie the library to spaCy — we want it to work with any other solution you provide, so long as that solution has the right capabilities.

So yes sometimes you definitely do need to build things with spaCy to solve a particular problem. And in fact, even spaCy isn’t a 100% satisfying wrapper around spaCy – spaCy also does the thing that’s confusing you here: it gives you mostly-good defaults, that sometimes you need to go and change. And past that, sometimes you’ll need to use an entirely different solution: spaCy provides the best general-purpose NLP things we know how to build; but that doesn’t mean no other NLP or ML tools will be useful for your problems.

Anyway. To answer your specific confusions. It might help to remember that Prodigy’s main mission in life is to ask you questions so you can annotate data. All the processing tools within Prodigy are in assistance of that mission. spaCy’s mission is to add annotations to text, and make the annotation easy to work with. Prodigy needs annotations to ask its questions, so it calls into spaCy. But you could call into other solutions, even simple functions you write yourself.

Everything in prodigy.models takes an iterator over tasks and returns an updated iterator, with scores and usually different questions. The purpose is to make a feed of questions for data annotation. It might be accidentally useful for other purposes, but that’s incidental.

Well, spaCy provides two different algorithms — which is unfortunate; we really try to give one of everything. We added the second algorithm, beam search, just for Prodigy. We considered putting it inside Prodigy, but it was too much behind the closed-source curtain — we strongly prefer to expose these things to you.

The beam search is important in producing varied NER questions. It’s designed to handle the case where the model’s really confidently wrong in its predictions. In this situation, beam search can still give you questions that help you guide the model back to the correct analysis. Without beam search, the annotation can get “stuck” in bad states.

Prodigy’s EntityRecognizer.__call__ method is designed to give you this variety of questions, from across the beam. If you’re just trying to add annotations to text, this is almost certainly not what you want.


(W.P. McNeill) #9

Of course everything I’m saying is a nitpick. You guys have delivered one of the most useful NLP toolkits in years and now I want to be able to understand it intuitively without reading the documentation. You’re victims of your own success. :slight_smile:

In building my mental model of the Prodigy/spaCy interfaces I forgot about beam search. I hadn’t factored “with beam search/without beam search” in as one of the dimensions along which interface functionality can vary. That’s the sort of meta-knowledge that helps make the interface discoverable. Of course, there are lots of dimensions along which a machine learning interface can vary, so communicating this meta-knowledge without overloading the reader is going to be difficult.

Happy path is easy and power user can do whatever they want with more effort is definitely the right model. I’ve just turned into a power user very quickly.


(Matthew Honnibal) #10

:bowing_man: :blush:

Unfortunately the cliff off the happy-path is steep (the void beckons)


(Cris) #11

Hi @wpm I’m trying your code and I’ve some editing as:

import prodigy
import spacy
from prodigy.util import log

import spacy.gold
import spacy.vocab
import spacy.tokens
import copy

from spacy.tokens import Span
from prodigy.components.preprocess import split_sentences, add_tokens
from prodigy.components.loaders import get_stream
from prodigy.core import recipe_args
from prodigy.util import split_evals, get_labels_from_ner, get_print, combine_models
from prodigy.util import read_jsonl,write_jsonl, set_hashes, log, prints
from prodigy.util import INPUT_HASH_ATTR


@prodigy.recipe('ner.make-gold',
        dataset=recipe_args['dataset'],
        spacy_model=recipe_args['spacy_model'],
        source=recipe_args['source'],
        api=recipe_args['api'],
        loader=recipe_args['loader'],
        patterns=recipe_args['patterns'],
        labels=recipe_args['label_set'],
        exclude=recipe_args['exclude'],
        unsegmented=recipe_args['unsegmented'])

def make_gold(dataset, spacy_model, source=None, api=None, loader=None,
              patterns=None, labels=None, exclude=None, unsegmented=False):
    """
    Create gold data for NER by correcting a model's suggestions.
    """
    log("RECIPE: Starting recipe ner.make-gold", locals())
    nlp = spacy.load(spacy_model)
    log("RECIPE: Loaded model {}".format(spacy_model))

    patterns_by_label = {}
    for entry in prodigy.util.read_jsonl(patterns):
        patterns_by_label.setdefault(entry['label'], []).append(entry['pattern'])
    matcher = spacy.matcher.Matcher(nlp.vocab)
    for pattern_label, patterns in patterns_by_label.items():
        matcher.add(pattern_label, None, *patterns)

    # Get the label set from the `label` argument, which is either a
    # comma-separated list or a path to a text file. If labels is None, check
    # if labels are present in the model.
    if labels is None:
        labels = set(get_labels_from_ner(nlp) + patterns_by_label.keys())
        print("Using {} labels from model: {}"
              .format(len(labels), ', '.join(labels)))
    log("RECIPE: Annotating with {} labels".format(len(labels)), labels)
    stream = get_stream(source, api=api, loader=loader, rehash=True,
                        dedup=True, input_key='text')
    # Split the stream into sentences
    if not unsegmented:
        stream = split_sentences(nlp, stream)
    # Tokenize the stream
    stream = add_tokens(nlp, stream)

    def make_tasks():
        """Add a 'spans' key to each example, with predicted entities."""
        texts = ((eg['text'], eg) for eg in stream)
        for doc, eg in nlp.pipe(texts, as_tuples=True):
            task = copy.deepcopy(eg)
            spans = []
            matches = matcher(doc)
            pattern_matches = tuple(Span(doc, start, end, label) for label, start, end in matches)
            for ent in doc.ents + pattern_matches:
                if labels and ent.label_ not in labels:
                    continue
                spans.append({
                    'token_start': ent.start,
                    'token_end': ent.end - 1,
                    'start': ent.start_char,
                    'end': ent.end_char,
                    'text': ent.text,
                    'label': ent.label_,
                    'source': spacy_model,
                    'input_hash': eg[INPUT_HASH_ATTR]
                })
            task['spans'] = spans
            task = set_hashes(task)
            yield task

    return {
        'view_id': 'ner_manual',
        'dataset': dataset,
        'stream': make_tasks(),
        'exclude': exclude,
        'update': None,
        'config': {'lang': nlp.lang, 'labels': labels}
    }

And run it on 1.6.1 I get this warnings:

Traceback (most recent call last):
  File "/anaconda3/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/anaconda3/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/anaconda3/lib/python3.6/site-packages/prodigy/__main__.py", line 259, in <module>
    controller = recipe(*args, use_plac=True)
  File "cython_src/prodigy/core.pyx", line 253, in prodigy.core.recipe.recipe_decorator.recipe_proxy
  File "/anaconda3/lib/python3.6/site-packages/plac_core.py", line 328, in call
    cmd, result = parser.consume(arglist)
  File "/anaconda3/lib/python3.6/site-packages/plac_core.py", line 207, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "prodigy-ner-mg.py", line 57, in make_gold
    labels = set(get_labels_from_ner(nlp) + patterns_by_label.keys())
TypeError: can only concatenate list (not "dict_keys") to list

I would like to understand to fix it :wink:

My best

C.


(Ines Montani) #12

As the error message says, patterns_by_label.keys() (i.e. the keys of a dictionary) aren’t exactly a list – so when you try to add them to an actual list returned by get_labels_from_ner(nlp), Python complains.

You should be able to solve this by simply wrapping the keys in a list, e.g.:

labels = set(get_labels_from_ner(nlp) + list(patterns_by_label.keys()))

Also see Stack Overflow for more details on this:


(Cris) #13

yes @Ines thanks and I solved it as your suggestion :wink: @wpm

Now:

import prodigy
import spacy
from prodigy.util import log

import spacy.gold
import spacy.vocab
import spacy.tokens
import copy

from spacy.tokens import Span
from prodigy.components.preprocess import split_sentences, add_tokens
from prodigy.components.loaders import get_stream
from prodigy.core import recipe_args
from prodigy.util import split_evals, get_labels_from_ner, get_print, combine_models
from prodigy.util import read_jsonl,write_jsonl, set_hashes, log, prints
from prodigy.util import INPUT_HASH_ATTR


@prodigy.recipe('ner.make-gold',
        dataset=recipe_args['dataset'],
        spacy_model=recipe_args['spacy_model'],
        source=recipe_args['source'],
        api=recipe_args['api'],
        loader=recipe_args['loader'],
        patterns=recipe_args['patterns'],
        labels=recipe_args['label_set'],
        exclude=recipe_args['exclude'],
        unsegmented=recipe_args['unsegmented'])

def make_gold(dataset, spacy_model, source=None, api=None, loader=None,
              patterns=None, labels=None, exclude=None, unsegmented=False):
    """
    Create gold data for NER by correcting a model's suggestions.
    """
    log("RECIPE: Starting recipe ner.make-gold", locals())
    nlp = spacy.load(spacy_model)
    log("RECIPE: Loaded model {}".format(spacy_model))

    patterns_by_label = {}
    for entry in prodigy.util.read_jsonl(patterns):
        patterns_by_label.setdefault(entry['label'], []).append(entry['pattern'])
    matcher = spacy.matcher.Matcher(nlp.vocab)
    for pattern_label, patterns in patterns_by_label.items():
        matcher.add(pattern_label, None, *patterns)

    # Get the label set from the `label` argument, which is either a
    # comma-separated list or a path to a text file. If labels is None, check
    # if labels are present in the model.
    if labels is None:
        #labels = set(get_labels_from_ner(nlp) + list(patterns_by_label.keys()))
        labels = set(get_labels_from_ner(nlp))
        
        print("Using {} labels from model: {}"
              .format(len(labels), ', '.join(labels)))
    log("RECIPE: Annotating with {} labels".format(len(labels)), labels)
    stream = get_stream(source, api=api, loader=loader, rehash=True,
                        dedup=True, input_key='text')
    # Split the stream into sentences
    if not unsegmented:
        stream = split_sentences(nlp, stream)
    # Tokenize the stream
    stream = add_tokens(nlp, stream)

    def make_tasks(nlp, stream):
        """Add a 'spans' key to each example, with predicted entities."""
        texts = ((eg['text'], eg) for eg in stream)
        for doc, eg in nlp.pipe(texts, as_tuples=True):
            task = copy.deepcopy(eg)
            spans = []
            matches = matcher(doc)
            pattern_matches = tuple(Span(doc, start, end, label) for label, start, end in matches)
            for ent in doc.ents + pattern_matches:
                if labels and ent.label_ not in labels:
                    continue
                spans.append({
                    'token_start': ent.start,
                    'token_end': ent.end - 1,
                    'start': ent.start_char,
                    'end': ent.end_char,
                    'text': ent.text,
                    'label': ent.label_,
                    'source': spacy_model,
                    'input_hash': eg[INPUT_HASH_ATTR]
                })
            task['spans'] = spans
            task = set_hashes(task)
            yield task

    return {
        'view_id': 'ner_manual',
        'dataset': dataset,
        'stream': make_tasks(nlp, stream),
        'exclude': exclude,
        'update': None,
        'config': {'lang': nlp.lang, 'labels': labels}
    }

After that I see: if the term in pattern file is also in the example, in the UI that term is showed twice :palms_up_together: