I often have additional metadata in my .jsonl
files when I upload them to Prodigy. When I run db-out
, the metadata is often missing. Examples: unique_id
, etc. Is there a way to ensure that all metadata in the input file is retained in the output file?
Hey @cheyanneb !
I just played around with a dummy dataset that included metadata and was unable to replicate your error.
I used the following input data to label:
{"text": "stroopwafels are great", "meta": {"unique_id": 1, "date": "12-03"}}
{"text": "apples are healthy", "meta": {"unique_id": 2, "date": "12-03"}}
{"text": "I love medialunas", "meta": {"unique_id": 3, "date": "12-03"}}
quickly labelled it with:
prodigy textcat.manual test_db ./data/raw/examples.jsonl --label POS,NEG
and gave it a save:
prodigy db-out test_db > ./data/labelled/labelled_examples.jsonl
when I inspected the output, the metadata associated to each task was indeed present in the .jsonl
file:
{"text":"stroopwafels are great","meta":{"unique_id":1,"date":"12-03"},"_input_hash":506862616,"_task_hash":-871522844,"options":[{"id":"POS","text":"POS"},{"id":"NEG","text":"NEG"}],"_view_id":"choice","config":{"choice_style":"multiple"},"accept":["POS"],"answer":"accept","_timestamp":1710242248,"_annotator_id":"2024-03-12_08-17-20","_session_id":"2024-03-12_08-17-20"}
{"text":"apples are healthy","meta":{"unique_id":2,"date":"12-03"},"_input_hash":111541500,"_task_hash":1919756499,"options":[{"id":"POS","text":"POS"},{"id":"NEG","text":"NEG"}],"_view_id":"choice","accept":["POS"],"config":{"choice_style":"multiple"},"answer":"accept","_timestamp":1710242251,"_annotator_id":"2024-03-12_08-17-20","_session_id":"2024-03-12_08-17-20"}
{"text":"I love medialunas","meta":{"unique_id":3,"date":"12-03"},"_input_hash":1251118931,"_task_hash":599886080,"options":[{"id":"POS","text":"POS"},{"id":"NEG","text":"NEG"}],"_view_id":"choice","accept":["POS"],"config":{"choice_style":"multiple"},"answer":"accept","_timestamp":1710242254,"_annotator_id":"2024-03-12_08-17-20","_session_id":"2024-03-12_08-17-20"}
Are you labelling pipelines highly custom? i.e. do you have any custom post-processing steps of the labelled data that might be dropping the metadata?
Hopefully we can get to the bottom of this!