How do I ner.print-stream on synthetic training data?

Original Question

I am generating synthetic data to train an NER model. It consists of text and labeled character spans. It’s basically the “Training an additional entity type” task as outlined in the spaCy documentation. I am currently writing my data to a JSONL file whose lines look like this:

["Horses are too tall and they pretend to care about your feelings", {"entities": [[0, 6, "ANIMAL"]]}]

I’d like to use something like ner.print-stream to visualize this data. What’s the easiest way to do this?

I don’t think there’s anything that does this right now in Prodigy. I think my best bet is to write a custom version of the ner.print-stream recipe that reads the spans directly out of the JSONL instead of generating them with a spaCy model. This looks pretty straightforward except that I don’t know how to integrate with the streams and tasks interfaces.

  • Can I modify the format of my JSONL output so that get_stream can read it directly?
  • Alternately, I suppose I could omit get_stream and create my own iterator over task objects directly from the JSONL. What is the format for those task objects? I don’t see that specified in either PRODIGY_README.html or the sample recipe code.

In general, what is the best way to enhance this workflow with Prodigy?

Figured Out Some Parts on My Own

My output format wasn’t proper JSONL because the lines weren’t dictionaries. Changing the format to the following makes ner.print-stream work:

{"text": "Horses are too tall and they pretend to care about your feelings", "entities": [[0, 6, "ANIMAL"]]}

It highlights the standard NER entities instead of my custom ones.

I also overlooked the documentation for the prodigy.components.loaders.JSONL file loader in PRODIGY_README.html. I suppose I could use that to get my custom entities displayed.

Any general advice on using Prodigy for this workflow still gratefully accepted. :smile:

Yes, the solution you came up with is correct and pretty much what I would have recommended :smiley:

You can probably do this even simpler in a custom recipe – ner.print-stream will use the model to set the entities, but in your case, you already have them, so there’s no need to run the model over your texts. Here’s a simplified example based on the built in ner.print-stream:

from prodigy.components.printers import pretty_print_ner

def print_stream():
    stream = generate_synthetic_examples()  # create your data

Thanks. I ended up writing this.

import prodigy
from prodigy.components.printers import pretty_print_ner
from prodigy.core import recipe, recipe_args
from prodigy.components.loaders import get_stream

def print_stream(source=recipe_args['source']):
    stream = get_stream(source, rehash=True, input_key='text')
1 Like