Long annotation task is not saved properly

I use text categorization recipe and saving annotation task which has long text (more than 64KB) causes a problem.
The annotation when saved is truncated and appropriate log message is displayed:

/home/user_name/.virtualenvs/my_virtual_env/lib/python3.6/site-packages/pymysql/cursors.py:170: Warning: (1265, "Data truncated for column 'content' at row 1")

The truncation causes damage in JSon format of the annotation task.
As I understand the reason for the truncation is because the column content has Blob type with limit of 64K.

I have problem in two following cases:

Case 1:
When I call get_dataset() function of Database class defined in db.py module. As a result I cannot get annotated tasks.

Case 2:
When I restart prodigy with same parameters as in the previous run. As a result Prodigy cannot be restarted.

In these two cases I get the following exception:

  File "/home/user_name/.virtualenvs/my_virtual_env/lib/python3.6/site-packages/prodigy/components/db.py", line 297, in get_dataset
    return [eg.load() for eg in examples]
  File "/home/user_name/.virtualenvs/my_virtual_env/lib/python3.6/site-packages/prodigy/components/db.py", line 297, in <listcomp>
    return [eg.load() for eg in examples]
  File "/home/user_name/.virtualenvs/my_virtual_env/lib/python3.6/site-packages/prodigy/components/db.py", line 99, in load
    return srsly.json_loads(content)
  File "/home/user_name/.virtualenvs/my_virtual_env/lib/python3.6/site-packages/srsly/_json_api.py", line 38, in json_loads
    return ujson.loads(data)
ValueError: Unmatched ''"' when when decoding 'string'

How can it be solved? Thank you!

Hi! You can probably find and clean up that corrputed entry by accessing your database directly. For example, this DB browser for SQLite: https://sqlitebrowser.org The database file prodigy.db is in your Prodigy home directory. You can find the path if you run prodigy stats. You probably also want to back up that file before you edit it, just in case.

As a workaround to prevent this problem in the future, you could either add a contition to your stream that checks the length of the examples before you send them out. Of course, it always depends on what you're training – but if you're doing text classification, there's usually no need to annotate and store whole documents as one, because you're probably averaging over sentences or smaller chunks anyways to compute the document score.