Came across an inconsistency in the behavior of db-out
and print-dataset
across the different database backends:
- sqlite annotations are displayed in ascending order
- postgres ones in descending order
Details follow.
Data & Recipe
I have a text file (sentences.txt) with the following contents:
This is the first sentence.
Second sentence.
Third.
The fourth.
Fifth sentence.
I annotate this data using both the SQLite and Postgress backends with this recipe:
pgy ner.manual eg_dataset en_core_web_sm sentences.txt --label A,B
Results ā SQLite
Command
pgy ner-print-dataset eg_dataset
Results
This is the first sentence.
Second sentence.
Third.
The fourth.
Fifth sentence.
Command 2
pgy db-out eg_dataset
Results 2
{"text":"This is the first sentence.","_input_hash":-12818154,"_task_hash":-822717874,"tokens":[{"text":"This","start":0,"end":4,"id":0},{"text":"is","start":5,"end":7,"id":1},{"text":"the","start":8,"end":11,"id":2},{"text":"first","start":12,"end":17,"id":3},{"text":"sentence","start":18,"end":26,"id":4},{"text":".","start":26,"end":27,"id":5}],"answer":"accept"}
{"text":"Second sentence.","_input_hash":-1429988537,"_task_hash":1832064099,"tokens":[{"text":"Second","start":0,"end":6,"id":0},{"text":"sentence","start":7,"end":15,"id":1},{"text":".","start":15,"end":16,"id":2}],"answer":"reject"}
{"text":"Third.","_input_hash":-1944134528,"_task_hash":-1709782840,"tokens":[{"text":"Third","start":0,"end":5,"id":0},{"text":".","start":5,"end":6,"id":1}],"answer":"ignore"}
{"text":"The fourth.","_input_hash":-1792504082,"_task_hash":-1572011286,"tokens":[{"text":"The","start":0,"end":3,"id":0},{"text":"fourth","start":4,"end":10,"id":1},{"text":".","start":10,"end":11,"id":2}],"answer":"accept"}
{"text":"Fifth sentence.","_input_hash":697961341,"_task_hash":73728894,"tokens":[{"text":"Fifth","start":0,"end":5,"id":0},{"text":"sentence","start":6,"end":14,"id":1},{"text":".","start":14,"end":15,"id":2}],"answer":"reject"}
Results ā Postgres
Command
pgy ner-print-dataset eg_dataset
Results
Fifth sentence.
The fourth.
Third.
Second sentence.
This is the first sentence.
Command 2
pgy db-out eg_dataset
Results 2
{"text":"Fifth sentence.","_input_hash":697961341,"_task_hash":73728894,"tokens":[{"text":"Fifth","start":0,"end":5,"id":0},{"text":"sentence","start":6,"end":14,"id":1},{"text":".","start":14,"end":15,"id":2}],"spans":[{"start":6,"end":14,"token_start":1,"token_end":1,"label":"A"}],"answer":"accept"}
{"text":"The fourth.","_input_hash":-1792504082,"_task_hash":-1572011286,"tokens":[{"text":"The","start":0,"end":3,"id":0},{"text":"fourth","start":4,"end":10,"id":1},{"text":".","start":10,"end":11,"id":2}],"answer":"accept"}
{"text":"Third.","_input_hash":-1944134528,"_task_hash":-1709782840,"tokens":[{"text":"Third","start":0,"end":5,"id":0},{"text":".","start":5,"end":6,"id":1}],"answer":"ignore"}
{"text":"Second sentence.","_input_hash":-1429988537,"_task_hash":1832064099,"tokens":[{"text":"Second","start":0,"end":6,"id":0},{"text":"sentence","start":7,"end":15,"id":1},{"text":".","start":15,"end":16,"id":2}],"answer":"accept"}
{"text":"This is the first sentence.","_input_hash":-12818154,"_task_hash":-822717874,"tokens":[{"text":"This","start":0,"end":4,"id":0},{"text":"is","start":5,"end":7,"id":1},{"text":"the","start":8,"end":11,"id":2},{"text":"first","start":12,"end":17,"id":3},{"text":"sentence","start":18,"end":26,"id":4},{"text":".","start":26,"end":27,"id":5}],"answer":"accept"}
Problem
As you can see above, the results are in the opposite order:
- SQLite version lists the first annotation first
- Postgres version lists the last annotation first
Solution (Temporary)
Iām guessing this is due to Prodigy not taking a stance on the order of the fetched results: leaving the different database connectors to return the results in their own default order. It would be great to hear what you think of this problem.
Not a blocker though: I have written custom sql code to fetch the annotations in a specific order:
SELECT L.id, E.id, E.content
FROM link L
INNER JOIN example E ON E.id = L.example_id
WHERE L.dataset_id = {dataset_id}
ORDER BY L.id ASC;
This yields the results in the expected order: