Load dataset from recipe

theoldhat · October 12, 2018, 4:26pm

Hi,

I was wondering if there was a method provided to load raw data into a dataset after it has been created?

I see that there is a method provided to create the dataset
db = connect()
db.add_dataset(datasetName)
assert datasetName in db

But I don’t see a method to load data from a jsonl file in after I create the dataset.

The jsonl file I’m trying to load contains just the text of interest and some basic metadata.

Thanks!

ines · October 12, 2018, 4:30pm

Yes, you can use the db-in command, which takes data in any format readable by Prodigy:

prodigy db-in new_set /tmp/news_headlines.jsonl
✨ Imported 1550 annotations to 'new_set'.

The --answer argument letds you set an answer – accept (default), reject or ignore – on the imported example. This is useful if you’re importing annotations created by other tools. You can also use the --dry argument to perform a dry run and see what would happen, so you can check that everything works as expected.

If you want to do this from Python, you can use the db.add_examples() method, which takes a list of examples and a list of dataset IDs:

db = connect()
db.add_examples(your_loaded_examples, datasets['your_dataset'])

theoldhat · October 12, 2018, 6:01pm

Thanks Ines! I’m always amazed at how quickly you respond. I don’t think I did a good job of trying to explain what I want to accomplish. I"m trying to load the data directly into the data set without having to call db-in from the command line.

For the example, say the jsonl file I create looks like this

{"text": "The vehicle is a red Toyota", "meta": {"source": "magazine", "frequency": 107}}
{"text": "The vehicle is a blue Subaru", "meta": {"source": "book", "frequency": 93}}

So I want to load this file into a dataset:

db = connect()
db.add_dataset(datasetName)
assert datasetName in db
datasets = []
datasets.append(datasetName)
filename = datasetName + '.jsonl'
with open(filename, 'w') as f:
    for item in dataForImport:
        f.write("%s\n" % item)
filepath = cwd + "\\" + filename
jsonData = JSONL(filepath)
db.add_examples(jsonData,datasets[0])

My hope is to do this part automatically so I can load the data through a script and let my annotator use ner.manual later them. Does that make sense?

ines · October 13, 2018, 10:27am

Thanks for the clarification. I’m still not sure I fully understand what you’re trying to do in your script, but I think the solution might be easier than you think If I read your code correctly, you’re writing all examples to a file and then loading that file back in? Couldn’t you just load the examples and write them to the dataset directly?

db.add_examples(dataForImport, datasets=['some_dataset'])

Also, a quick note on JSONL: JSONL is newline-delimited JSON, so one JSON object per line. The easiest way to read it in is to read in every line and then call json.loads on the line. The jsonlines library also has some tools for that if you prefer. Since Prodigy reads and writes JSONL a lot, you can also use the internal helper function read_jsonl:

from prodigy.util import read_jsonl

loaded_examples = list(read_jsonl('/path/to/data.jsonl'))
db.add_examples(loaded_examples, datasets=['some_datasets'])

theoldhat · October 15, 2018, 1:59pm

Hi Ines,

You are correct in what I am trying to do! I think the problem might actually be my input file. When I try to load it to a data set via the command line it works as I expect, but when I attempt to load it using the add_examples function (and the code from above). I get a key error on the input hash:

> File "C:\Python36\lib\site-packages\prodigy\components\db.py", line 289, in add_examples
>     eg = Example.create(input_hash=eg[INPUT_HASH_ATTR],
> KeyError: '_input_hash'

Do I need to generate that input hash and have it as an attribute stored with each entry in my input file? Right now I just have a text attribute and a meta attribute.

> {"text": "The vehicle is a red Toyota", "meta": {"source": "magazine", "frequency": 107}}
> {"text": "The vehicle is a blue Subaru", "meta": {"source": "book", "frequency": 93}}

Also, if I’ve gone beyond the scope of this support forum, I completely understand if you need to leave me be to figure this out.

ines · October 15, 2018, 2:10pm

Ah yes, if the data isn't coming from another Prodigy recipe, you do need to set the hashes manually. You can do this using the set_hashes helper, which takes a single annotation taks dictionary and adds the hashes:

from prodigy import set_hashes

examples = [set_hashes(eg) for eg in examples]

theoldhat · October 15, 2018, 5:04pm

Yes! It worked. Thanks Ines.

Topic		Replies	Views
Is there a faster way to add records to a prodigy db than "add_examples"? done , database , solved	4	672	March 25, 2019
Loading a dataset from the DB instead of from disk/api? usage , solved	4	1973	March 6, 2018
Adding new data to be annotated without re-starting the server usage , database	10	246	November 3, 2023
Annotate multiple JSONL into multiple Datasets usage , database , solved , streams	2	550	October 7, 2021
Loading message prodigy UI usage , solved	7	784	September 12, 2019

Load dataset from recipe

Related topics