Load dataset from recipe

Hi,

I was wondering if there was a method provided to load raw data into a dataset after it has been created?

I see that there is a method provided to create the dataset
db = connect()
db.add_dataset(datasetName)
assert datasetName in db

But I don’t see a method to load data from a jsonl file in after I create the dataset.

The jsonl file I’m trying to load contains just the text of interest and some basic metadata.

Thanks!

Yes, you can use the db-in command, which takes data in any format readable by Prodigy:

prodigy db-in new_set /tmp/news_headlines.jsonl
✨ Imported 1550 annotations to 'new_set'.

The --answer argument letds you set an answer – accept (default), reject or ignore – on the imported example. This is useful if you’re importing annotations created by other tools. You can also use the --dry argument to perform a dry run and see what would happen, so you can check that everything works as expected.

If you want to do this from Python, you can use the db.add_examples() method, which takes a list of examples and a list of dataset IDs:

db = connect()
db.add_examples(your_loaded_examples, datasets['your_dataset'])

Thanks Ines! I’m always amazed at how quickly you respond. I don’t think I did a good job of trying to explain what I want to accomplish. I"m trying to load the data directly into the data set without having to call db-in from the command line.

For the example, say the jsonl file I create looks like this

{"text": "The vehicle is a red Toyota", "meta": {"source": "magazine", "frequency": 107}}
{"text": "The vehicle is a blue Subaru", "meta": {"source": "book", "frequency": 93}}

So I want to load this file into a dataset:

db = connect()
db.add_dataset(datasetName)
assert datasetName in db
datasets = []
datasets.append(datasetName)
filename = datasetName + '.jsonl'
with open(filename, 'w') as f:
    for item in dataForImport:
        f.write("%s\n" % item)
filepath = cwd + "\\" + filename
jsonData = JSONL(filepath)
db.add_examples(jsonData,datasets[0])

My hope is to do this part automatically so I can load the data through a script and let my annotator use ner.manual later them. Does that make sense?

Thanks for the clarification. I’m still not sure I fully understand what you’re trying to do in your script, but I think the solution might be easier than you think :smiley: If I read your code correctly, you’re writing all examples to a file and then loading that file back in? Couldn’t you just load the examples and write them to the dataset directly?

db.add_examples(dataForImport, datasets=['some_dataset'])

Also, a quick note on JSONL: JSONL is newline-delimited JSON, so one JSON object per line. The easiest way to read it in is to read in every line and then call json.loads on the line. The jsonlines library also has some tools for that if you prefer. Since Prodigy reads and writes JSONL a lot, you can also use the internal helper function read_jsonl:

from prodigy.util import read_jsonl

loaded_examples = list(read_jsonl('/path/to/data.jsonl'))
db.add_examples(loaded_examples, datasets=['some_datasets'])

Hi Ines,

You are correct in what I am trying to do! I think the problem might actually be my input file. When I try to load it to a data set via the command line it works as I expect, but when I attempt to load it using the add_examples function (and the code from above). I get a key error on the input hash:

> File "C:\Python36\lib\site-packages\prodigy\components\db.py", line 289, in add_examples
>     eg = Example.create(input_hash=eg[INPUT_HASH_ATTR],
> KeyError: '_input_hash'

Do I need to generate that input hash and have it as an attribute stored with each entry in my input file? Right now I just have a text attribute and a meta attribute.

> {"text": "The vehicle is a red Toyota", "meta": {"source": "magazine", "frequency": 107}}
> {"text": "The vehicle is a blue Subaru", "meta": {"source": "book", "frequency": 93}}

Also, if I’ve gone beyond the scope of this support forum, I completely understand if you need to leave me be to figure this out.

Ah yes, if the data isn't coming from another Prodigy recipe, you do need to set the hashes manually. You can do this using the set_hashes helper, which takes a single annotation taks dictionary and adds the hashes:

from prodigy import set_hashes

examples = [set_hashes(eg) for eg in examples]

Yes! It worked. Thanks Ines.

1 Like