corrupted dataset?

bgreer · April 21, 2019, 11:48am

The dataset that I’ve started building became corrupted during a recent textcat.teach session. When I attempt to dump the contents of the set, I get this error:

prodigy db-out ooo_seed

07:47:14 - DB: Initialising database MySQL
07:47:14 - DB: Connecting to database MySQL
07:47:14 - DB: Loading dataset ‘ooo_seed’ (282 examples)
Traceback (most recent call last):
File “/opt/rh/rh-python36/root/usr/lib64/python3.6/runpy.py”, line 193, in _run_module_as_main
“main”, mod_spec)
File “/opt/rh/rh-python36/root/usr/lib64/python3.6/runpy.py”, line 85, in _run_code
exec(code, run_globals)
File “/home/bruce/ml/ml_venv/lib64/python3.6/site-packages/prodigy/main.py”, line 323, in
plac.call(commands[command], arglist=args, eager=False)
File “/home/bruce/ml/ml_venv/lib64/python3.6/site-packages/plac_core.py”, line 328, in call
cmd, result = parser.consume(arglist)
File “/home/bruce/ml/ml_venv/lib64/python3.6/site-packages/plac_core.py”, line 207, in consume
return cmd, self.func(*(args + varargs + extraopts), **kwargs)
File “/home/bruce/ml/ml_venv/lib64/python3.6/site-packages/prodigy/main.py”, line 263, in db_out
examples = DB.get_dataset(set_id)
File “/home/bruce/ml/ml_venv/lib64/python3.6/site-packages/prodigy/components/db.py”, line 286, in get_dataset
return [eg.load() for eg in examples]
File “/home/bruce/ml/ml_venv/lib64/python3.6/site-packages/prodigy/components/db.py”, line 286, in
return [eg.load() for eg in examples]
File “/home/bruce/ml/ml_venv/lib64/python3.6/site-packages/prodigy/components/db.py”, line 98, in load
return ujson.loads(content)
ValueError: Unmatched ‘’"’ when when decoding ‘string’

Is there some way to correct the dataset in MySql or do I need to drop it and start over?

Thanks!

ines · April 21, 2019, 12:59pm

Hi! Do you have an idea for what could have caused the corruption? What’s your data and workflow like? The example contents are dumped as JSON, so it’s pretty confusing that you ended up with corrupted data here

The source of the db.py is shipped with Prodigy, so maybe you could add a print statement that prints content just before that line it fails on? It’d be interesting to see what’s in it. You can use the following command to find the location of your Prodigy installation btw:

python -c "import prodigy; print(prodigy.__file__)"

Once you’ve found the problematic example, it shouldn’t be a problem to fix it manually in the DB.

bgreer · April 21, 2019, 4:00pm

Hi Ines - Thanks for the lightning-fast response!

Turns out that the input text was too long and the JSON was being truncated to 64K bytes when being added to the example table.

I’ve updated my process to ensure that the text objects in my JSONL input files are shorter - MUCH shorter.

You could consider adding checks to prodigy to ensure the JSON source added to the example table is not so long that the DB will truncate it. The exact limit is probably DB-specific. I’m using MySQL.

Once I figured out the DB layout, all I needed to do was delete the rows from link that pointed to the too-large example entries.

ines · April 22, 2019, 11:27am

Yay, glad to hear it worked!

And thanks for the info, that’s good to know. It might make sense to just have a more generic warning in the stream if the incoming examples are surprisingly large like this. Realistically, if a user is passing around 64kb examples, they could end up having all kinds of other problems down the line and might not even realise it. So having at least a warning when you start the stream could probably be really useful here.

Topic		Replies	Views
db-merge errors database , solved	3	809	July 22, 2019
Unexpected character in found when decoding object value enhancement , done , database , solved	18	18589	July 22, 2020
Dataset creation: Unexpected character in found when decoding object value usage	5	2898	July 17, 2019
db-in error after db-out database , solved , windows	6	1195	February 10, 2022
prodigy drop <dataset> not working	5	338	July 24, 2023

corrupted dataset?

prodigy db-out ooo_seed

Related topics