Original Question
I am generating synthetic data to train an NER model. It consists of text and labeled character spans. It’s basically the “Training an additional entity type” task as outlined in the spaCy documentation. I am currently writing my data to a JSONL file whose lines look like this:
["Horses are too tall and they pretend to care about your feelings", {"entities": [[0, 6, "ANIMAL"]]}]
I’d like to use something like ner.print-stream to visualize this data. What’s the easiest way to do this?
I don’t think there’s anything that does this right now in Prodigy. I think my best bet is to write a custom version of the ner.print-stream recipe that reads the spans directly out of the JSONL instead of generating them with a spaCy model. This looks pretty straightforward except that I don’t know how to integrate with the streams and tasks interfaces.
- Can I modify the format of my JSONL output so that
get_streamcan read it directly? - Alternately, I suppose I could omit
get_streamand create my own iterator over task objects directly from the JSONL. What is the format for those task objects? I don’t see that specified in eitherPRODIGY_README.htmlor the sample recipe code.
In general, what is the best way to enhance this workflow with Prodigy?
Figured Out Some Parts on My Own
My output format wasn’t proper JSONL because the lines weren’t dictionaries. Changing the format to the following makes ner.print-stream work:
{"text": "Horses are too tall and they pretend to care about your feelings", "entities": [[0, 6, "ANIMAL"]]}
It highlights the standard NER entities instead of my custom ones.
I also overlooked the documentation for the prodigy.components.loaders.JSONL file loader in PRODIGY_README.html. I suppose I could use that to get my custom entities displayed.
Any general advice on using Prodigy for this workflow still gratefully accepted. 
