Cannot load data with db-in on Prodigy 1.8.3 using annotations created with 1.6

We have a prodigy database in PostgreSQL holding annotations that we created using Prodigy 1.6.1. We’re in the process of updating to Prodigy 1.8.3. As part of that process, we’ve shifted to SQLite.

We’ve exported annotations from this database using db-out (with Prodigy 1.8.3) just fine to a jsonl file.

Now, in trying to move those to a SQLite database, we’re running the following commands:

$ prodigy dataset my_dataset_name "My English Dataset"
$ prodigy db-in my_dataset_name my_dataset_name.jsonl

The last command errors with the following trace:

17:19:20 - APP: Using Hug endpoints (deprecated)
17:19:21 - DB: Initialising database SQLite
17:19:21 - DB: Connecting to database SQLite
17:19:21 - LOADER: Using file extension 'jsonl' to find loader
17:19:21 - LOADER: Loading stream from jsonl
Traceback (most recent call last):
  File "cython_src/prodigy/components/loaders.pyx", line 145, in prodigy.components.loaders.JSONL
  File "/usr/local/lib/python3.7/site-packages/srsly/_json_api.py", line 37, in json_loads
    return ujson.loads(data)
ValueError: Trailing data

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/local/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.7/site-packages/prodigy/__main__.py", line 372, in <module>
    plac.call(commands[command], arglist=args, eager=False)
  File "/usr/local/lib/python3.7/site-packages/plac_core.py", line 328, in call
    cmd, result = parser.consume(arglist)
  File "/usr/local/lib/python3.7/site-packages/plac_core.py", line 207, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "/usr/local/lib/python3.7/site-packages/prodigy/__main__.py", line 225, in db_in
    annotations = [set_hashes(eg) for eg in annotations]
  File "/usr/local/lib/python3.7/site-packages/prodigy/__main__.py", line 225, in <listcomp>
    annotations = [set_hashes(eg) for eg in annotations]
  File "cython_src/prodigy/components/loaders.pyx", line 152, in JSONL
ValueError: Failed to load task (invalid JSON).

{"text":"\n3.","_input_hash":1308918942,"_task_has  ...  end":492,"label":"ENTITY_TYPE"}],"answer":"accept"}

The following dependencies are installed:

blis==0.2.4
cachetools==3.1.1
certifi==2019.3.9
chardet==3.0.4
cymem==2.0.2
falcon==1.4.1
hug==2.4.8
idna==2.8
jsonschema==2.6.0
murmurhash==1.0.2
numpy==1.16.4
peewee==2.10.2
plac==0.9.6
preshed==2.0.1
prodigy==1.8.3
psycopg2-binary==2.8.2
PyJWT==1.7.1
python-mimeparse==1.6.0
requests==2.22.0
six==1.12.0
spacy==2.1.4
srsly==0.0.6
thinc==7.0.4
toolz==0.9.0
tqdm==4.32.1
urllib3==1.25.3
waitress==1.2.1
wasabi==0.2.2

Any assistance is greatly appreciated!

Hi! Did you edit the file in a text editor between exporting and importing? If you see an “Invalid JSON” error, it usually means exactly that – somewhere between json.dumps (during export) and json.loads (during import), a line got corrupted. One of the most common sources of the problem is if the file was edited manually, or saved again in an editor that messed up the line endings or inserted trailing newlines. But you might also want to check that the format is indeed JSONL (one object per line).

If you want to inspect the file more closely and find the line that can’t be loaded, you could use a script like this and print the line in fails on:

from pathlib import Path
import json

with Path("my_dataset_name.jsonl").open("r", encoding="utf8") as f:
    for line in f:
        try:
            json.loads(line.strip())
        except:
            print(line)

Hi @ines, thanks for the quick response.

No modifications were made to the files.

As a test, ran these commands in immediate sequence:

$ prodigy stats
17:58:49 - APP: Using Hug endpoints (deprecated)
17:58:49 - DB: Initialising database PostgreSQL
17:58:50 - DB: Connecting to database PostgreSQL

  ✨  Prodigy stats

Version          1.8.3
Location         /usr/local/lib/python3.7/site-packages/prodigy
Prodigy Home     /prodigy
Platform         Linux-4.9.125-linuxkit-x86_64-with-debian-9.9
Python Version   3.7.3
Database Name    PostgreSQL
Database Id      postgresql
Total Datasets   2
Total Sessions   44

$ prodigy db-out my_dataset_name .
17:56:19 - APP: Using Hug endpoints (deprecated)
17:56:19 - DB: Initialising database PostgreSQL
17:56:19 - DB: Connecting to database PostgreSQL
17:56:28 - DB: Loading dataset 'my_dataset_name' (9566 examples)

  ✨  Exported 9566 annotations for 'my_dataset_name' from database PostgreSQL
  /prodigy/my_dataset_name.jsonl

$ python find_bad_lines.py
Found bad line 9565, saving to bad/line_9565.out

Contents of find_bad_lines.py:

from pathlib import Path
import json

with Path("my_dataset_name.jsonl").open("r", encoding="utf8") as f:
    for num, line in enumerate(f):
        try:
            json.loads(line.strip())
        except:
            out_path = Path(f'bad/line_{num}.txt') 
            print(f'Found bad line {num}, saving to {str(out_path)}')
            with open(out_path, 'w', encoding='utf8') as out: 
                out.write(line)

Looking at the output file, it seems the issue is that two JSON objects in the JSONL file are on the same line. Could it be that when my training sessions alternated between ner.teach and ner.make-gold, the annotation data was not properly separated on export?

Thanks for running the tests and glad you found the "offending" line. This is very mysterious, though, and I wonder how this could have happened :thinking:

Hmm, this is pretty unlikely, because the examples are only really turned into the JSONL representation when you export them. When you call db-out, Prodigy retrieves the examples as a list first, then it pretty much calls json.dumps on each dict in the list, joins the records with a newline and writes the result to a file. If something was broken in the examples, the list of example dicts would have been broken as well and nothing would have worked. So there's very little magic here.

Can you reproduce this problem btw? Like, when you export the file again in v1.6, does it produce the same corrputed line?

Just checked - the file is not corrupted when exported with 1.6.1 or 1.7.1, and files exported with these two versions are identical.

Also worth noting, files exported with 1.8.3 are considerably larger - almost double in size. my_dataset_name-1.7.1.jsonl is 38 MB, my_dataset_name-1.8.3.jsonl is 76 MB

Thanks for the report, this is quite eerie. I’m properly perplexed. Oops, nevermind! Okay there’s a bug in the serialization library we’ve been using in v1.8, that we weren’t using in v1.6.

The bug is that the srsly.write_jsonl function sets the file mode to "a" by default: https://github.com/explosion/srsly/blob/master/srsly/_json_api.py#L89 . We need to change this to "w" by default and have a flag to tell it to append.

The good news is that the workaround in the meantime is simple: if you remove the filename you’ve been writing to, you’ll find it writes to a new file cleanly. The append mode explains both the concatenated invalid line, and the mysteriously increasing file size…

PR on srsly:

Great, thank you! Any idea of when you intend to release v0.0.7 on pypi?

Edit: I see it now - this solves the issue. Thanks again for the help!

1 Like