Retain metadata in output

I often have additional metadata in my .jsonl files when I upload them to Prodigy. When I run db-out, the metadata is often missing. Examples: unique_id, etc. Is there a way to ensure that all metadata in the input file is retained in the output file?

Hey @cheyanneb !

I just played around with a dummy dataset that included metadata and was unable to replicate your error.

I used the following input data to label:

{"text": "stroopwafels are great", "meta": {"unique_id": 1, "date": "12-03"}}
{"text": "apples are healthy", "meta": {"unique_id": 2, "date": "12-03"}}
{"text": "I love medialunas", "meta": {"unique_id": 3, "date": "12-03"}}

quickly labelled it with:

prodigy textcat.manual test_db ./data/raw/examples.jsonl --label POS,NEG

and gave it a save:

prodigy db-out test_db > ./data/labelled/labelled_examples.jsonl

when I inspected the output, the metadata associated to each task was indeed present in the .jsonl file:

{"text":"stroopwafels are great","meta":{"unique_id":1,"date":"12-03"},"_input_hash":506862616,"_task_hash":-871522844,"options":[{"id":"POS","text":"POS"},{"id":"NEG","text":"NEG"}],"_view_id":"choice","config":{"choice_style":"multiple"},"accept":["POS"],"answer":"accept","_timestamp":1710242248,"_annotator_id":"2024-03-12_08-17-20","_session_id":"2024-03-12_08-17-20"}
{"text":"apples are healthy","meta":{"unique_id":2,"date":"12-03"},"_input_hash":111541500,"_task_hash":1919756499,"options":[{"id":"POS","text":"POS"},{"id":"NEG","text":"NEG"}],"_view_id":"choice","accept":["POS"],"config":{"choice_style":"multiple"},"answer":"accept","_timestamp":1710242251,"_annotator_id":"2024-03-12_08-17-20","_session_id":"2024-03-12_08-17-20"}
{"text":"I love medialunas","meta":{"unique_id":3,"date":"12-03"},"_input_hash":1251118931,"_task_hash":599886080,"options":[{"id":"POS","text":"POS"},{"id":"NEG","text":"NEG"}],"_view_id":"choice","accept":["POS"],"config":{"choice_style":"multiple"},"answer":"accept","_timestamp":1710242254,"_annotator_id":"2024-03-12_08-17-20","_session_id":"2024-03-12_08-17-20"}

Are you labelling pipelines highly custom? i.e. do you have any custom post-processing steps of the labelled data that might be dropping the metadata?

Hopefully we can get to the bottom of this!