Question about EntityRecognizer

I'm building a custom recipe following ner.teach

What is the relation between PatternMatcher and EntityRecognizer? (the one from prodigy.models.ner) EntityRecognizer isn't documented anywhere :confused:
If i add the same patterns to both by doing something like

PatternMatcher.add_patterns(my_patterns)
EntityRecognizer(....patterns=my_patterns)

and then combine the models

predict, update = combine_models(model, matcher)

what is happening in combine_models?
Is defining patterns redundant for EntityRecognizer? Should i just use EntityRecognizer and not a PatternMatcher? OR just a PatternMatcher and not an EntityRecognizer. (I want the resulting model to update as i'm annotating like in ner.teach)

Also I'm using a custom tokenizer on my spacy model (nlp.tokenizer = nlp.tokenizer.tokens_from_list) and passing a list of tokens to ["text"] instead of string. I have managed to get everything else working but model
update(answers) -> where answers["text"] is the list of tokens is giving me this error:

cython_src/prodigy/models/ner.pyx in prodigy.models.ner.EntityRecognizer.update(
cython_src/prodigy/models/ner.pyx in prodigy.models.ner.merge_spans()
TypeError: unhashable type: 'list'

pattern_matcher.update(answers) is working though i'm not sure what it is that it is updating since it is just a pattern matcher?

The EntityRecognizer (which should have probably been named something else) is the annotation model for the active learning-powered entity annotation. Just like the PatternMatcher, which follows the same API, it exposes two relevant methods:

  • __call__: Process a stream of examples and yield (score, example) tuples.
  • update: Take a list of answers and update the model.

Its job is to produce a scored stream with NER predictions – it doesn't deal with any patterns or anything else. The PatternMatcher follows the same API, only that it adds pattern matches and yields (score, example) tuples, and updates the probabilities for the patterns when it receives answers. The combine_models helper combines both methods, intereaves the results and makes sure both models are updated.

Here are the docs for the PatternMatcher API: https://prodi.gy/docs/api-components#patternmatcher
And here's the combine_models helper: https://prodi.gy/docs/api-components#combine_matches

The problem here is likely that "text" is not a string – Prodigy's data format expects the "text" to be a string. That's also what the model will be updated with. So if your nlp object expects a list as input, that's really non-standard and won't work with the built-in implementations in Prodigy, that all expect you're working with raw text and that the incoming text matches the processed doc.text.

I have managed to fix the model.update error by changing my tokenizer logic (made a custom tokenizer class) :slight_smile: but now I have another problem/question :thinking:
If i start with no patterns (nothing even in the nlp model) can PatternMather add patterns through the update logic, annotating for multiple labels at a time with combine_matches=True.

PatternMatcher(
    vanilla_nlp.nlp, filter_labels=vanilla_nlp.labels, task_hash_keys=("text"), all_examples=True, combine_matches=True
)

The documentation says the spans must have a pattern key for it to be updatable. but the answer spans I see returned don't set that automatically.
(Edit: the "pattern" fields disappear on the answers' spans! opened another issue for it-> [Front End BUG] Pattern Ids disappear from spans upon modification)

Expects the examples to have an "answer" key ( "accept" , "reject" or "ignore" ) and will use all "spans" that have a "pattern" key, which is the ID of the pattern assigned by PatterMatcher.__call__

Would adding a pattern to the PatternMatcher manually before the update then setting the pattern id for the span work?

The ner.teach example recipe teaches one label at a time so I'm not sure if it is possible for multiple labels?

So my main question shortly is: can PatternMatcher.update add patterns or only modify the existing patterns' scores?

I've replied to the pattern ID in the spans on the other thread – I think part of what makes this particular scenario difficult is the fact that the UI accepts manual edits. This introduces a lot of additional questions for how to score the matches and what the scores mean, that you otherwise wouldn't have if you just give binary feedback on the.

The pattern ID is based on the pattern itself and the label, so even if you used the same pattern for different labels, those would receive different IDs. There's actually very little magic going on in the PatternMatcher and it's mostly just a wrapper around spaCy's Matcher and PhraseMatcher. So you could even implement your own version of this and customise how the patterns are scored:

  • Generate a unique ID for each pattern.
  • When a pattern is matched, add meta information to each task containing the pattern IDs.
  • When an answer comes back, check whether the match was "good" (match was accepted, or rejected/changed).
    • If the match was "good", increment the correct counter for the given pattern ID.
    • If it wasn't, increment the incorrect counter for the given pattern ID.
  • Whenever a pattern is matched in your stream, check the correct and incorrect scores of the pattern ID, and calculate a pattern score (e.g. ratio of correct to incorrect counts), which reflects how likely the pattern is to be helpful, based on previous feedback.

Sooo I shouldn't have bothered with the PatternMatcher and EntityRecognizer in prodigy and just trained a spacy EntityRecognizer all along... well that is 3 days down the drain :sob: + possibly more

Implementing a custom PatternMatcher is not the road I wanna go down on (especially for a closed source class i have little chance of debugging)

I had assumed 'active learning' feature would be available for the multi-label tagging (I hardly wanna go over the same examples again and again... tagging one label at a time). Honestly with the pattern Ids disappearing upon modification I'm not sure how it even works for the single label case.

Sorry to hear, but I don't think your efforts are all wasted? The thing is, every problem is different and implementing custom ideas for active learning takes some experimentation and there are not always easy answers.

The active learning workflow of presenting binary suggestions using beam search, interleaved with binary pattern matches and uncertainty sampling is something we've tested and that worked well for many use cases, so that's what we're shipping as a built-in recipe in ner.teach. Something similar might work if you're manually editing the spans, but it will likely require a different approach to scoring all spans and matches and updating the model with the annotations. It's something that could be built and it's definitely possible to in a Prodigy recipe, but it's difficult to predict what strategy is going to work best.

The PatternMatcher is a simple built-in wrapper for the most common matching use cases – but you definitely don't have to use it. As I described above, the underlying logic is pretty straightforward, so you could also replace it with regular expressions or anything else you want to use to pre-select examples for annotation.