db-merge errors

I’ve annotated a number of examples (using a mix of ner.teach and ner.manual) in different datasets, and now need to merge them all together to create one dataset that I can train a model on.

When running pgy db-merge dataset1,dataset2 out_dataset, I get the following error:

Traceback (most recent call last):
  File "/usr/local/Cellar/python/3.7.4/Frameworks/Python.framework/Versions/3.7/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/local/Cellar/python/3.7.4/Frameworks/Python.framework/Versions/3.7/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/Users/alex/.local/share/virtualenvs/prodigy-MEGM_AZG/lib/python3.7/site-packages/prodigy/__main__.py", line 372, in <module>
    plac.call(commands[command], arglist=args, eager=False)
  File "/Users/alex/.local/share/virtualenvs/prodigy-MEGM_AZG/lib/python3.7/site-packages/plac_core.py", line 328, in call
    cmd, result = parser.consume(arglist)
  File "/Users/alex/.local/share/virtualenvs/prodigy-MEGM_AZG/lib/python3.7/site-packages/plac_core.py", line 207, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "/Users/alex/.local/share/virtualenvs/prodigy-MEGM_AZG/lib/python3.7/site-packages/prodigy/__main__.py", line 317, in db_merge
    examples = DB.get_dataset(set_id)
  File "/Users/alex/.local/share/virtualenvs/prodigy-MEGM_AZG/lib/python3.7/site-packages/prodigy/components/db.py", line 296, in get_dataset
    return [eg.load() for eg in examples]
  File "/Users/alex/.local/share/virtualenvs/prodigy-MEGM_AZG/lib/python3.7/site-packages/prodigy/components/db.py", line 296, in <listcomp>
    return [eg.load() for eg in examples]
  File "/Users/alex/.local/share/virtualenvs/prodigy-MEGM_AZG/lib/python3.7/site-packages/prodigy/components/db.py", line 99, in load
    return srsly.json_loads(content)
  File "/Users/alex/.local/share/virtualenvs/prodigy-MEGM_AZG/lib/python3.7/site-packages/srsly/_json_api.py", line 37, in json_loads
    return ujson.loads(data)
ValueError: Unmatched ''"' when when decoding 'string'

Running prodigy v1.8.3.

Thanks in advance :slight_smile:

Is it possible that something in your database got corrupted? The db-merge recipe does very little magic and ultimately, it just calls into db.get_dataset. And loading one example here seems to fail because something is corrupted :thinking:

If you just run the following, you should see the same error:

from prodigy.components.db import connect
db = connect()
for dataset in ["dataset1", "dataset2"]:
    examples = db.get_dataset(dataset)

If it turns out that there’s a problem with your SQLite database, it shouldn’t be too tricky to recover it, though. You can use the same code Prodigy does in db.py to find the problematic example and then adjust it using an SQLite browser.

It’d also be very interesting to investigate how this happened. The example data is dumped as JSON, then stored and then loaded back as JSON. So it’s unlikely that something happened here, because dumps/loads is consistent. So I wonder if maybe the connection was lost while an example was being written?

Thanks for getting back - just this minute solved it. Some of the items in the example table where truncated because of their length which resulted in invalid JSON being stored (and later parsed). Should be able to avoid the problem in the future by changing the field type from BLOB to LONGBLOB

1 Like

Thanks for updating and glad you solved it! We’re slightly reluctant to make changes to the DB schema that introduce backwards incompatibilities (at least for Prodigy v1.x). But if there’s a way to implement the BLOBLONGBLOB type without it, we’d definitely be interested in changing that.