Inconsistent database query order (in db-out, print-dataset)

Came across an inconsistency in the behavior of db-out and print-dataset across the different database backends:

  • sqlite annotations are displayed in ascending order
  • postgres ones in descending order

Details follow.

Data & Recipe

I have a text file (sentences.txt) with the following contents:

This is the first sentence.
Second sentence.
Third.
The fourth.
Fifth sentence.

I annotate this data using both the SQLite and Postgress backends with this recipe:

pgy ner.manual eg_dataset en_core_web_sm sentences.txt --label A,B

Results ā€” SQLite

Command

pgy ner-print-dataset eg_dataset

Results

This is the first sentence.
Second sentence.
Third.
The fourth.
Fifth sentence.

Command 2

pgy db-out eg_dataset

Results 2

{"text":"This is the first sentence.","_input_hash":-12818154,"_task_hash":-822717874,"tokens":[{"text":"This","start":0,"end":4,"id":0},{"text":"is","start":5,"end":7,"id":1},{"text":"the","start":8,"end":11,"id":2},{"text":"first","start":12,"end":17,"id":3},{"text":"sentence","start":18,"end":26,"id":4},{"text":".","start":26,"end":27,"id":5}],"answer":"accept"}
{"text":"Second sentence.","_input_hash":-1429988537,"_task_hash":1832064099,"tokens":[{"text":"Second","start":0,"end":6,"id":0},{"text":"sentence","start":7,"end":15,"id":1},{"text":".","start":15,"end":16,"id":2}],"answer":"reject"}
{"text":"Third.","_input_hash":-1944134528,"_task_hash":-1709782840,"tokens":[{"text":"Third","start":0,"end":5,"id":0},{"text":".","start":5,"end":6,"id":1}],"answer":"ignore"}
{"text":"The fourth.","_input_hash":-1792504082,"_task_hash":-1572011286,"tokens":[{"text":"The","start":0,"end":3,"id":0},{"text":"fourth","start":4,"end":10,"id":1},{"text":".","start":10,"end":11,"id":2}],"answer":"accept"}
{"text":"Fifth sentence.","_input_hash":697961341,"_task_hash":73728894,"tokens":[{"text":"Fifth","start":0,"end":5,"id":0},{"text":"sentence","start":6,"end":14,"id":1},{"text":".","start":14,"end":15,"id":2}],"answer":"reject"}

Results ā€” Postgres

Command

pgy ner-print-dataset eg_dataset

Results

Fifth  sentence.
The fourth.
Third.
Second sentence.
This is the first sentence.

Command 2

pgy db-out eg_dataset

Results 2

{"text":"Fifth sentence.","_input_hash":697961341,"_task_hash":73728894,"tokens":[{"text":"Fifth","start":0,"end":5,"id":0},{"text":"sentence","start":6,"end":14,"id":1},{"text":".","start":14,"end":15,"id":2}],"spans":[{"start":6,"end":14,"token_start":1,"token_end":1,"label":"A"}],"answer":"accept"}
{"text":"The fourth.","_input_hash":-1792504082,"_task_hash":-1572011286,"tokens":[{"text":"The","start":0,"end":3,"id":0},{"text":"fourth","start":4,"end":10,"id":1},{"text":".","start":10,"end":11,"id":2}],"answer":"accept"}
{"text":"Third.","_input_hash":-1944134528,"_task_hash":-1709782840,"tokens":[{"text":"Third","start":0,"end":5,"id":0},{"text":".","start":5,"end":6,"id":1}],"answer":"ignore"}
{"text":"Second sentence.","_input_hash":-1429988537,"_task_hash":1832064099,"tokens":[{"text":"Second","start":0,"end":6,"id":0},{"text":"sentence","start":7,"end":15,"id":1},{"text":".","start":15,"end":16,"id":2}],"answer":"accept"}
{"text":"This is the first sentence.","_input_hash":-12818154,"_task_hash":-822717874,"tokens":[{"text":"This","start":0,"end":4,"id":0},{"text":"is","start":5,"end":7,"id":1},{"text":"the","start":8,"end":11,"id":2},{"text":"first","start":12,"end":17,"id":3},{"text":"sentence","start":18,"end":26,"id":4},{"text":".","start":26,"end":27,"id":5}],"answer":"accept"}

Problem

As you can see above, the results are in the opposite order:

  • SQLite version lists the first annotation first
  • Postgres version lists the last annotation first

Solution (Temporary)

Iā€™m guessing this is due to Prodigy not taking a stance on the order of the fetched results: leaving the different database connectors to return the results in their own default order. It would be great to hear what you think of this problem.

Not a blocker though: I have written custom sql code to fetch the annotations in a specific order:

SELECT L.id, E.id, E.content
FROM link L
INNER JOIN example E ON E.id = L.example_id
WHERE L.dataset_id = {dataset_id}
ORDER BY L.id ASC;

This yields the results in the expected order:

54%20PM

Thanks for the super detailed report and analysis! This is interesting and I had no idea there was such a significant difference ā€“ I just assumed things would be consistent, although, it's a bad thing to assume because they almost never are :wink:

Yes, this definitely makes sense. For Prodigy, we use peewee to manage the database integration ā€“ I just checked and while we do use order_by explicity when fetching the datasets, I don't think we currently impose any ordering when we get the examples. I'll have a look and see how to best solve this! :+1: (By the way, the source of components/db.py is shipped with Prodigy, so you can probably also edit something there in the meantime.)

1 Like