Return pattern ID instead of pattern number -- ner.manual with pattern jsonl

Hi!

I'm using ner.manual to train a model using patterns first. The patterns' jsonl looks like this:

{"id": "300386874", "label": "Objects", "pattern": [{"lower": "porcelain"}, {"lower": "ware"}]}

The interface currently shows me the following:
image
However, I would like the interface to display the pattern ID instead of the pattern number to go over the matches more easily.

I am not too sure how to do that. Any help is really appreciated!

Thanks!

Hi! This is definitely a good suggestion and I'll put it on my list of enhancements :+1: I think what makes this a bit tricky at the moment is that we're currently not tracking whether a pattern ID comes from the original file or whether it was auto-generated (in which case, displaying the ID is less useful). But we can probably find a solution for that.

In the meantime, a hacky(ish) solution would be to do something like this, read in your patterns, get a mapping of line number/ID, look up the line numbers based on the meta and replace the information. Untested, but something like this should work:

import srsly

def add_pattern_ids(stream, patterns):
    patterns_data = srsly.read_jsonl(patterns)
    ids_by_line = {i: pattern.get("id") for i, pattern in enumerate(patterns_data)}
    for eg in stream:
        meta = eg.get("meta", {})
        if "pattern" in meta:
            line_nos = meta["pattern"].split(", ")
            pattern_ids = [ids_by_line.get(line_no) for line_no in line_nos]
            eg["meta"]["pattern"] = ", ".join(pattern_ids)
            yield eg

# At the end of the recipe
stream = add_pattern_ids(stream, patterns)
1 Like

Thank you so much for your help! I made a slight change to the code, it was returning None's. When getting ids, the input was a string rather than an integer, therefore it couldn't find anything.

import srsly

def add_pattern_ids(stream, patterns):
    patterns_data = srsly.read_jsonl(patterns)
    ids_by_line = {i: pattern.get("id") for i, pattern in enumerate(patterns_data)}
    for eg in stream:
        meta = eg.get("meta", {})
        if "pattern" in meta:
            line_nos = list(map(int, list(filter(None,meta["pattern"].split(", ")))))   
            pattern_ids = [str(ids_by_line.get(line_no)) for line_no in line_nos]
            eg["meta"]["pattern"] = ", ".join(pattern_ids)
            yield eg

Thanks again. This was super helpful!

1 Like

Thanks for the update, glad it worked! (And good point, I hadn't considered that :sweat_smile:)