Some postgres data fields stored as hex (bytea hex format)

I was having trouble reading the backend postgres data using psql directly due to the data being encoded. Notably, the following fields appear to be stored in hex bytea format by default when switching to a postgres backend: dataset.meta and example.content. It looks like the print-dataset recipe handles the decoding, but I find querying the raw data still useful.

When querying directly, this query handles the decoding (CONVERT_FROM expects the first argumenst to be a string bytea, and you can specify the “to” type):

prodigy=#  
   SELECT CONVERT_FROM(content, 'utf8')
     FROM example;

You can find more information about the hex format used by postgres here: https://www.postgresql.org/docs/9.1/static/datatype-binary.html#AEN5296
And more information about type conversions here: https://www.postgresql.org/docs/9.1/static/functions-string.html

Describe dataset

prodigy=# \d+ dataset
 Column  |          Type          |                      Modifiers                       | Storage  | Stats target | Description
---------+------------------------+------------------------------------------------------+----------+--------------+------------
 id      | integer                | not null default nextval('dataset_id_seq'::regclass) | plain    |              | 
 name    | character varying(255) | not null                                             | extended |              | 
 created | integer                | not null                                             | plain    |              | 
 meta    | bytea                  | not null                                             | extended |              | 
 session | boolean                | not null                                             | plain    |              | 

Describe example

prodigy=# \d+ example
   Column   |  Type   |                      Modifiers                       | Storage  | Stats target | Description 
------------+---------+------------------------------------------------------+----------+--------------+-------------
 id         | integer | not null default nextval('example_id_seq'::regclass) | plain    |              | 
 input_hash | bigint  | not null                                             | plain    |              | 
 task_hash  | bigint  | not null                                             | plain    |              | 
 content    | bytea   | not null                                             | extended |              | 

Hmm! This is news to me actually. We use an ORM, peewee, to manage the database connectivity. Perhaps this is something it’s doing? Good to know, anyway!

Any news in this topic? It’s really annoying when you work with postgresql because you can’t use JSON operators in queries. peewee supports postgresql but it needs additional import for postgresql features. You can read more about it in peewee documentation.

@sdarmofal We’re open to suggestions, but it’s not immediately obvious to me what we would need to do. Prodigy does provide the source for the database driver – we made sure to keep this in pure Python, as we figured some people might want to edit it directly. You should be able to find it in prodigy/components/db.py. So, you can probably make the edits you require directly.

If there’s a change we can make that improves things without causing backwards incompatibilities, we’d consider that. But if we have to change a value type in the table, I think that would force users to migrate their data, right? We’d be reluctant to do that.

1 Like