Long annotation task is not saved properly

Yuri · November 30, 2019, 4:02pm

Hi,
I use text categorization recipe and saving annotation task which has long text (more than 64KB) causes a problem.
The annotation when saved is truncated and appropriate log message is displayed:

/home/user_name/.virtualenvs/my_virtual_env/lib/python3.6/site-packages/pymysql/cursors.py:170: Warning: (1265, "Data truncated for column 'content' at row 1")

The truncation causes damage in JSon format of the annotation task.
As I understand the reason for the truncation is because the column content has Blob type with limit of 64K.

I have problem in two following cases:

Case 1:
When I call get_dataset() function of Database class defined in db.py module. As a result I cannot get annotated tasks.

Case 2:
When I restart prodigy with same parameters as in the previous run. As a result Prodigy cannot be restarted.

In these two cases I get the following exception:

  File "/home/user_name/.virtualenvs/my_virtual_env/lib/python3.6/site-packages/prodigy/components/db.py", line 297, in get_dataset
    return [eg.load() for eg in examples]
  File "/home/user_name/.virtualenvs/my_virtual_env/lib/python3.6/site-packages/prodigy/components/db.py", line 297, in <listcomp>
    return [eg.load() for eg in examples]
  File "/home/user_name/.virtualenvs/my_virtual_env/lib/python3.6/site-packages/prodigy/components/db.py", line 99, in load
    return srsly.json_loads(content)
  File "/home/user_name/.virtualenvs/my_virtual_env/lib/python3.6/site-packages/srsly/_json_api.py", line 38, in json_loads
    return ujson.loads(data)
ValueError: Unmatched ''"' when when decoding 'string'

How can it be solved? Thank you!

ines · December 1, 2019, 6:49pm

Hi! You can probably find and clean up that corrputed entry by accessing your database directly. For example, this DB browser for SQLite: https://sqlitebrowser.org The database file prodigy.db is in your Prodigy home directory. You can find the path if you run prodigy stats. You probably also want to back up that file before you edit it, just in case.

As a workaround to prevent this problem in the future, you could either add a contition to your stream that checks the length of the examples before you send them out. Of course, it always depends on what you're training – but if you're doing text classification, there's usually no need to annotate and store whole documents as one, because you're probably averaging over sentences or smaller chunks anyways to compute the document score.

Topic		Replies	Views
Unexpected character in found when decoding object value enhancement , done , database , solved	18	18634	July 22, 2020
documents length and annotation time usage , ner , solved , streams	13	949	December 4, 2020
ValueError: Unmatched ''"' when when decoding 'string' database , solved	5	5904	May 17, 2020
github annotations / textcat example usage , textcat	3	893	January 21, 2019
Text choice save error - Request Entity Too Large usage , front-end , solved	11	934	September 17, 2019

Long annotation task is not saved properly

Related topics