hi @zparcheta!
Thanks for your question and welcome to the Prodigy community 
Yes, I think you're right. I got the same problem. I went to the source code and found we don't even use the delimiter as an argument 
Prodigy pro tip: you can find the built-in recipe code by looking at the Location:
path of prodigy stats
. This one is in the folder components/loaders.py
. You could even modify this now so that your built-in recipe works correctly.
The good news is this looks like a minor fix that I can put in a ticket. It likely won't get released until our next patch.
In the meantime, I've created a custom Python script that can convert your .csv
to a .jsonl
(Prodigy's preferred format):
# csv_to_jsonl.py
import csv
import json
import typer
import srsly
def csv_to_jsonl(csv_file, jsonl_file, delimiter):
with open(csv_file, 'r') as file:
reader = csv.DictReader(file, delimiter=delimiter)
rows = list(reader)
jsonl_data = []
for row in rows:
json_data = {"text": row.pop("text"), "meta": row}
jsonl_data.append(json.dumps(json_data))
with open(jsonl_file, 'w') as file:
file.write('\n'.join(jsonl_data))
typer.echo(f"Conversion complete. JSONL file created: {jsonl_file}")
def main(csv_file: str, jsonl_file: str, delimiter: str):
convert_csv_to_jsonl(csv_file, jsonl_file, delimiter)
if __name__ == "__main__":
typer.run(main)
Let's assume you have this input.csv
(semicolon delimited):
text;column1;column2
"This is a sentence";"Some meta info";"Some other meta info"
The additional columns are optional and you should only include them if you want them to be included in the interface as metadata.
You can convert it to a .jsonl
by running:
python csv_to_jsonl.py input.csv output.jsonl ";"
# output.jsonl
{"text": "This is a sentence", "meta": {"column1": "Some meta info", "column2": "Some other meta info"}}
Alternatively as a second option, you can use it as a custom loader by modifying the script to be a Prodigy recipe and load it:
# csv_to_jsonl_prodigy.py
import csv
import json
import typer
import srsly
import prodigy
@prodigy.recipe("csv_to_jsonl")
def csv_to_jsonl(csv_file, delimiter):
with open(csv_file, 'r') as file:
reader = csv.DictReader(file, delimiter=delimiter)
rows = list(reader)
jsonl_data = []
for row in rows:
json_data = {"text": row.pop("text"), "meta": row}
print(json.dumps(json_data))
def main(csv_file: str, delimiter: str):
convert_csv_to_jsonl(csv_file, delimiter)
if __name__ == "__main__":
typer.run(main)
Notice that the only differences are adding the Prodigy decorator (@prodigy.recipe("csv_to_jsonl")
, importing prodigy
and printing out each json line instead of exporting it to a .jsonl
file.
python -m prodigy convert_csv_to_jsonl input.csv ";" -F csv_to_jsonl_prodigy.py | python -m prodigy ner.manual ner_data blank:en --label ORG,PERSON -
Sorry again for the hassle but thank you a lot for bringing this to our attention!