corrupted dataset?

The dataset that I’ve started building became corrupted during a recent textcat.teach session. When I attempt to dump the contents of the set, I get this error:

prodigy db-out ooo_seed

07:47:14 - DB: Initialising database MySQL
07:47:14 - DB: Connecting to database MySQL
07:47:14 - DB: Loading dataset ‘ooo_seed’ (282 examples)
Traceback (most recent call last):
File “/opt/rh/rh-python36/root/usr/lib64/python3.6/runpy.py”, line 193, in _run_module_as_main
main”, mod_spec)
File “/opt/rh/rh-python36/root/usr/lib64/python3.6/runpy.py”, line 85, in _run_code
exec(code, run_globals)
File “/home/bruce/ml/ml_venv/lib64/python3.6/site-packages/prodigy/main.py”, line 323, in
plac.call(commands[command], arglist=args, eager=False)
File “/home/bruce/ml/ml_venv/lib64/python3.6/site-packages/plac_core.py”, line 328, in call
cmd, result = parser.consume(arglist)
File “/home/bruce/ml/ml_venv/lib64/python3.6/site-packages/plac_core.py”, line 207, in consume
return cmd, self.func(*(args + varargs + extraopts), **kwargs)
File “/home/bruce/ml/ml_venv/lib64/python3.6/site-packages/prodigy/main.py”, line 263, in db_out
examples = DB.get_dataset(set_id)
File “/home/bruce/ml/ml_venv/lib64/python3.6/site-packages/prodigy/components/db.py”, line 286, in get_dataset
return [eg.load() for eg in examples]
File “/home/bruce/ml/ml_venv/lib64/python3.6/site-packages/prodigy/components/db.py”, line 286, in
return [eg.load() for eg in examples]
File “/home/bruce/ml/ml_venv/lib64/python3.6/site-packages/prodigy/components/db.py”, line 98, in load
return ujson.loads(content)
ValueError: Unmatched ‘’"’ when when decoding ‘string’

Is there some way to correct the dataset in MySql or do I need to drop it and start over?

Thanks!

Hi! Do you have an idea for what could have caused the corruption? What’s your data and workflow like? The example contents are dumped as JSON, so it’s pretty confusing that you ended up with corrupted data here :thinking:

The source of the db.py is shipped with Prodigy, so maybe you could add a print statement that prints content just before that line it fails on? It’d be interesting to see what’s in it. You can use the following command to find the location of your Prodigy installation btw:

python -c "import prodigy; print(prodigy.__file__)"

Once you’ve found the problematic example, it shouldn’t be a problem to fix it manually in the DB.

Hi Ines - Thanks for the lightning-fast response!

Turns out that the input text was too long and the JSON was being truncated to 64K bytes when being added to the example table.

I’ve updated my process to ensure that the text objects in my JSONL input files are shorter - MUCH shorter.

You could consider adding checks to prodigy to ensure the JSON source added to the example table is not so long that the DB will truncate it. The exact limit is probably DB-specific. I’m using MySQL.

Once I figured out the DB layout, all I needed to do was delete the rows from link that pointed to the too-large example entries.

Yay, glad to hear it worked! :+1:

And thanks for the info, that’s good to know. It might make sense to just have a more generic warning in the stream if the incoming examples are surprisingly large like this. Realistically, if a user is passing around 64kb examples, they could end up having all kinds of other problems down the line and might not even realise it. So having at least a warning when you start the stream could probably be really useful here.