I have a file called 1.jsonl with the following contents:
{"text": "SECTION 2. KENTUCKY LAW TO APPLY \n\nThis Agreement shall be construed and the provisions thereof interpreted under and in accordance with laws of Kentucky.", "entities": [[8, 9, "SECTION_NUMBER_ENTITY"], [11, 19, "JURISDICTION_ENTITY"], [145, 153, "JURISDICTION_ENTITY"]]}
Thanks for the report! I just did some experiments and it looks like it’s related to this fix. Prodigy now checks whether the terminal supports ANSI colors, before adding the escape sequences (to prevent unreadable markup).
But for some reason, this check always returns False when the output is piped to less I think the approach we’re using mistakenly excludes this. Anyway, sorry about this – I think I already know how to fix this.Edit: Fixed!
If your JSONL files aren’t very large, you could just not pipe it to less and the colours should display fine.
I’m running in terminal on a Mac. The terminal properties are set to display ANSI colors. If instead I run the built-in recipe that runs NER
prodigy ner.print-stream en 1.jsonl
it does highlight with colors. So there is something wrong with my custom recipe. (Though I swear I’ve got it to work before.)
Do I have to do something special because I’m using custom entity names? The issue isn’t custom entity names. I see the same behavior if I change the names to CARDINAL and GPE.
However, since your recipe uses get_stream (which will then use the JSONL loader, since your file is a .jsonl file), the entities need to be in Prodigy’s expected format, i.e. a list of "spans", containing one dictionary per span. For example:
I got confused here because there are multiple serialization formats for this markup information. Prodigy uses span dictionaries. SpaCy uses entity lists. The spAcy command line tool uses something else again. I’m trying to settle on one format, but I’m not sure which one to choose.
Yeah, I definitely know what you mean – sorry if this is confusing. (We're currently working on improving the formats used by spaCy to make them more internally consistent.)
The reason we decided to use a different format for Prodigy is that we needed a format that was more open and flexible – and less specific to the individual tasks. In spaCy, entities are entities – in Prodigy, "spans" can be used for entities, but in the end, they're whatever you want them to be. They're merely describing a span of text with an optional label and optional other parameters . What you do with them is entirely up to you. JSON objects are also generally nicer to work with if you need the data to flow through the back-end, front-end and REST API.
In case you haven't seen it already, you might find the ner.gold-to-spacy recipe useful. It takes a dataset of NER annotations, and converts them to spaCy's formats (either offsets or BILUO tags).
Your explanation of the difference between entities and spans makes sense. I suspect that standardizing the data structures between spaCy and Prodigy–choosing either dictionaries or lists of tuples–while keeping the “entities” vs “spans” nomenclature would clear this up. If the only thing differing between two data structures is a keyword, it’s apparent that the distinction between them is semantic.
ner.gold-to-spacy is helpful. I hadn’t noticed that before.
The one last data format question that is confusing to me is how the spaCy command line application fits in. Right now I’m writing my own programs that implement training loops. I’m thinking I’ll steal the idea from the spacy command of setting hyperparameters via environment variables because that seems like a good way to handle that whole mess. But I have the feeling that I’m just reimplementing stuff that already exists inside the spacy command. I’d rather just use spacy instead, but it takes yet another input format.