Original Question
I am generating synthetic data to train an NER model. It consists of text and labeled character spans. It’s basically the “Training an additional entity type” task as outlined in the spaCy documentation. I am currently writing my data to a JSONL file whose lines look like this:
["Horses are too tall and they pretend to care about your feelings", {"entities": [[0, 6, "ANIMAL"]]}]
I’d like to use something like ner.print-stream
to visualize this data. What’s the easiest way to do this?
I don’t think there’s anything that does this right now in Prodigy. I think my best bet is to write a custom version of the ner.print-stream
recipe that reads the spans directly out of the JSONL instead of generating them with a spaCy model. This looks pretty straightforward except that I don’t know how to integrate with the streams and tasks interfaces.
- Can I modify the format of my JSONL output so that
get_stream
can read it directly? - Alternately, I suppose I could omit
get_stream
and create my own iterator over task objects directly from the JSONL. What is the format for those task objects? I don’t see that specified in eitherPRODIGY_README.html
or the sample recipe code.
In general, what is the best way to enhance this workflow with Prodigy?
Figured Out Some Parts on My Own
My output format wasn’t proper JSONL because the lines weren’t dictionaries. Changing the format to the following makes ner.print-stream
work:
{"text": "Horses are too tall and they pretend to care about your feelings", "entities": [[0, 6, "ANIMAL"]]}
It highlights the standard NER entities instead of my custom ones.
I also overlooked the documentation for the prodigy.components.loaders.JSONL
file loader in PRODIGY_README.html
. I suppose I could use that to get my custom entities displayed.
Any general advice on using Prodigy for this workflow still gratefully accepted.