Retain metadata in output

cheyanneb · March 11, 2024, 6:37pm

I often have additional metadata in my .jsonl files when I upload them to Prodigy. When I run db-out, the metadata is often missing. Examples: unique_id, etc. Is there a way to ensure that all metadata in the input file is retained in the output file?

india-kerle · March 12, 2024, 11:24am

Hey @cheyanneb !

I just played around with a dummy dataset that included metadata and was unable to replicate your error.

I used the following input data to label:

{"text": "stroopwafels are great", "meta": {"unique_id": 1, "date": "12-03"}}
{"text": "apples are healthy", "meta": {"unique_id": 2, "date": "12-03"}}
{"text": "I love medialunas", "meta": {"unique_id": 3, "date": "12-03"}}

quickly labelled it with:

prodigy textcat.manual test_db ./data/raw/examples.jsonl --label POS,NEG

and gave it a save:

prodigy db-out test_db > ./data/labelled/labelled_examples.jsonl

when I inspected the output, the metadata associated to each task was indeed present in the .jsonl file:

{"text":"stroopwafels are great","meta":{"unique_id":1,"date":"12-03"},"_input_hash":506862616,"_task_hash":-871522844,"options":[{"id":"POS","text":"POS"},{"id":"NEG","text":"NEG"}],"_view_id":"choice","config":{"choice_style":"multiple"},"accept":["POS"],"answer":"accept","_timestamp":1710242248,"_annotator_id":"2024-03-12_08-17-20","_session_id":"2024-03-12_08-17-20"}
{"text":"apples are healthy","meta":{"unique_id":2,"date":"12-03"},"_input_hash":111541500,"_task_hash":1919756499,"options":[{"id":"POS","text":"POS"},{"id":"NEG","text":"NEG"}],"_view_id":"choice","accept":["POS"],"config":{"choice_style":"multiple"},"answer":"accept","_timestamp":1710242251,"_annotator_id":"2024-03-12_08-17-20","_session_id":"2024-03-12_08-17-20"}
{"text":"I love medialunas","meta":{"unique_id":3,"date":"12-03"},"_input_hash":1251118931,"_task_hash":599886080,"options":[{"id":"POS","text":"POS"},{"id":"NEG","text":"NEG"}],"_view_id":"choice","accept":["POS"],"config":{"choice_style":"multiple"},"answer":"accept","_timestamp":1710242254,"_annotator_id":"2024-03-12_08-17-20","_session_id":"2024-03-12_08-17-20"}

Are you labelling pipelines highly custom? i.e. do you have any custom post-processing steps of the labelled data that might be dropping the metadata?

Hopefully we can get to the bottom of this!

Topic		Replies	Views
Adding data to a Prodigy dataset using db-in - is there a way to filter out/remove duplicate annotations? usage , solved	2	410	January 4, 2023
Export Output - sequentially	2	292	May 2, 2022
Is it possible to make Prodigy export a Tokenized JSONL file by inputting a JSON file with no annotations done on the dataset? ner , solved	1	480	October 10, 2022
keeping information from training data in the dataset usage , database , solved	1	419	January 29, 2020
Storing external IDs with Annotations usage , database , solved	2	528	September 13, 2019

Retain metadata in output

Related Topics