No metadata when using `ner.silver-to-gold` recipe

I’m using the ner.silver-to-gold recipe to combine my binary annotations produced via the ner.teach recipe. When I view the UI I notice that the original metadata is not there. Upon further inspection, it looks like the "meta” key is lost somewhere while running the ner.silver-to-gold recipe.

I looked over the recipe’s code in the the explosion/prodigy-recipes repo and tried to debug, but learned that the code isn’t runnable with with my current version of Prodigy (v1.18.2). I changed course and looked in Prodigy’s source code in my environment and learned that it is significantly different from the code in the repo.

I have two questions from all this:

  1. How do I get the metadata in my “silver” data to show up in the ner.silver-to-gold UI?
  2. Where do I find current source code for a Prodigy recipe seeing that the public repo is not actively maintained?

Welcome to the forum @it176131! :waving_hand:

Where do I find current source code for a Prodigy recipe seeing that the public repo is not actively maintained?

You're absolutely right, and we appreciate you pointing it out. Our public recipes repo does lag behind our current releases, and that's something we need to improve. We're a small team stretched across core development and user support, so keeping the open source recipes repo perfectly in sync is challenging, which is why we really appreciate posts like this one.

In the meantime, the source code is the best place to look for the current implementation—and this is actually why we removed Cython encryption from v1.16 onwards, to make the code more transparent and easier to debug and understand directly.

The source code for the ner.silver-to-gold recipe can be found in your Prodigy installation path and then recipes/ner.py. If you need to double check where Prodigy is installed on your machine, you can run prodigy stats which prints it to stdout under Location keyword.

How do I get the metadata in my “silver” data to show up in the ner.silver-to-gold UI?

The recipe's main job is to take all the different "silver" annotations you've made for the same piece of text and merge them into one new, combined example. When it creates this new merged example, the code was designed to only build the most essential parts: the text and the new, merged spans. It doesn't have a built-in mechanism to know which of the potentially conflicting meta fields from your different silver annotations to keep, so it simply doesn't copy any of them over to the new example. It's really hard to know upfront what logic to apply.
That said, since you have access to the source code, you can definitely persist this information by modifying the function that builds the new example i.e. the make_best method called on line 390 of the recipe:

 stream = model.make_best(data)

This method is defined in models/ner.py. Here's an example modification that takes the meta from the first example in the group but you can of course customize it in case you need to merge the values of the meta field in a different way:

def make_best(self, examples: Iterable[TaskType]) -> Iterable[TaskType]:
        """Add spans to a dataset for the best predictions, using the model and
        previous annotation decisions.
        """
        log("MODEL: Get best predictions for examples")
        golds = merge_spans(examples)
        for batch in partition_all(32, golds):
            batch = list(batch)
            batch_texts = [eg["text"] for eg in batch]
            batch_annots = [eg["spans"] for eg in batch]
            beam = _BatchBeam(self.nlp, batch_texts, w=16, b=NER_DEFAULT_BEAM_DENSITY)
            for i, parse in enumerate(
                beam.predict_best(batch_annots, max_wrong=None, min_right=None)
            ):
              
                original_eg = batch[i]
                # copy the original example
                eg = copy.deepcopy(original_eg)
                # overwrite spans rather than create the example from scratch
                eg["spans"] = parse
                eg[BINARY_ATTR] = False
                eg = set_hashes(eg)
                yield eg

Alternatively, you can do something similar directly at the recipe level by storing meta values by input_hash and then readding them to the stream:

# Store meta information keyed by input hash, so we can add it back later
metas_by_hash = {}
for eg in data:
    if "meta" in eg and INPUT_HASH_ATTR in eg:
        metas_by_hash[eg[INPUT_HASH_ATTR]] = eg["meta"]

def add_meta_back(stream: StreamType) -> StreamType:
    for eg in stream:
        eg_copy = copy.deepcopy(eg)
        if INPUT_HASH_ATTR in eg_copy and eg_copy[INPUT_HASH_ATTR] in metas_by_hash:
            eg_copy["meta"] = metas_by_hash[eg_copy[INPUT_HASH_ATTR]]
        yield eg_copy

stream = Stream(GeneratorSource(iter(stream)), loader=load_noop, wrappers=[])
stream.apply(add_meta_back)
stream.apply(filter_stream)
1 Like

Thank you for the reply! I was looking at modifying the make_best function to keep the metadata, but I think your alternate approach may be less invasive.

For the metadata, is there a reason to keep the “score” key-value pair created by the ner.teach recipe?