How do I ner.print-stream on synthetic training data?

wpm · January 15, 2018, 4:55pm

Original Question

I am generating synthetic data to train an NER model. It consists of text and labeled character spans. It’s basically the “Training an additional entity type” task as outlined in the spaCy documentation. I am currently writing my data to a JSONL file whose lines look like this:

["Horses are too tall and they pretend to care about your feelings", {"entities": [[0, 6, "ANIMAL"]]}]

I’d like to use something like ner.print-stream to visualize this data. What’s the easiest way to do this?

I don’t think there’s anything that does this right now in Prodigy. I think my best bet is to write a custom version of the ner.print-stream recipe that reads the spans directly out of the JSONL instead of generating them with a spaCy model. This looks pretty straightforward except that I don’t know how to integrate with the streams and tasks interfaces.

Can I modify the format of my JSONL output so that get_stream can read it directly?
Alternately, I suppose I could omit get_stream and create my own iterator over task objects directly from the JSONL. What is the format for those task objects? I don’t see that specified in either PRODIGY_README.html or the sample recipe code.

In general, what is the best way to enhance this workflow with Prodigy?

Figured Out Some Parts on My Own

My output format wasn’t proper JSONL because the lines weren’t dictionaries. Changing the format to the following makes ner.print-stream work:

{"text": "Horses are too tall and they pretend to care about your feelings", "entities": [[0, 6, "ANIMAL"]]}

It highlights the standard NER entities instead of my custom ones.

I also overlooked the documentation for the prodigy.components.loaders.JSONL file loader in PRODIGY_README.html. I suppose I could use that to get my custom entities displayed.

Any general advice on using Prodigy for this workflow still gratefully accepted.

ines · January 15, 2018, 8:59pm

Yes, the solution you came up with is correct and pretty much what I would have recommended

You can probably do this even simpler in a custom recipe – ner.print-stream will use the model to set the entities, but in your case, you already have them, so there’s no need to run the model over your texts. Here’s a simplified example based on the built in ner.print-stream:

from prodigy.components.printers import pretty_print_ner

@prodigy.recipe('ner.print-stream')
def print_stream():
    stream = generate_synthetic_examples()  # create your data
    pretty_print_ner(stream)

wpm · January 16, 2018, 5:50pm

Thanks. I ended up writing this.

import prodigy
from prodigy.components.printers import pretty_print_ner
from prodigy.core import recipe, recipe_args
from prodigy.components.loaders import get_stream

@prodigy.recipe('ner.print-stream')
def print_stream(source=recipe_args['source']):
    stream = get_stream(source, rehash=True, input_key='text')
    pretty_print_ner(stream)

Topic		Replies	Views
Export ner.print-stream output usage , ner , solved	3	682	November 15, 2018
ner.print-stream for patterns? enhancement , ner	3	695	December 30, 2017
Training prodigy ner data through spacy usage , ner , spacy , solved	3	893	January 8, 2020
Spacy NER model results into a format of prodigy dataset jsonl format Getting Started usage , ner , spacy , solved	2	416	October 14, 2020
ner.train on data not annotated by Spacy? ner	3	1148	June 11, 2018

How do I ner.print-stream on synthetic training data?

Original Question

Figured Out Some Parts on My Own

Related topics