Dataset output -format

I would like to know is it possible to customise the output of prodigy db-out recipe for NER annotation dataset ? The default output has so many info including hash, accept, timestamp etc. For example the following example:
{"text":"hi good afternoon this is lily from a b c travel agency how can i help you }"
here the afternoon is TIME entity. The prodigy db-out output is

{"text":"hi good afternoon this is lily from a b c travel agency how can i help you","_input_hash":162856980,"_task_hash":1109836302,"_is_binary":false,"tokens":[{"text":"hi","start":0,"end":2,"id":0,"ws":true},{"text":"good","start":3,"end":7,"id":1,"ws":true},{"text":"afternoon","start":8,"end":17,"id":2,"ws":true},{"text":"this","start":18,"end":22,"id":3,"ws":true},{"text":"is","start":23,"end":25,"id":4,"ws":true},{"text":"lily","start":26,"end":30,"id":5,"ws":true},{"text":"from","start":31,"end":35,"id":6,"ws":true},{"text":"a","start":36,"end":37,"id":7,"ws":true},{"text":"b","start":38,"end":39,"id":8,"ws":true},{"text":"c","start":40,"end":41,"id":9,"ws":true},{"text":"travel","start":42,"end":48,"id":10,"ws":true},{"text":"agency","start":49,"end":55,"id":11,"ws":true},{"text":"how","start":56,"end":59,"id":12,"ws":true},{"text":"can","start":60,"end":63,"id":13,"ws":true},{"text":"i","start":64,"end":65,"id":14,"ws":true},{"text":"help","start":66,"end":70,"id":15,"ws":true},{"text":"you","start":71,"end":74,"id":16,"ws":false}],"_view_id":"ner_manual","spans":[{"start":8,"end":17,"token_start":2,"token_end":2,"label":"TIME"},{"start":26,"end":30,"token_start":5,"token_end":5,"label":"PERSON"}],"answer":"accept","_timestamp":1715756520,"_annotator_id":"2024-05-15_15-01-26","_session_id":"2024-05-15_15-01-26"}

1.a ) Is it possible remove most of the info and get minimum info of the target Label name and staring Span of the label and ending Span. May be some basic info as well.

For example:

{"text":"hi good afternoon this is lily from a b c travel agency how can i help you","entities": [{"text": "afternoon", "label": "TIME"}]}

1.b) can it be done via Python script or other means?

Thanks for great tool Prodigy and your support! Any thoughts highly useful. :slight_smile:

Cheers!
e101sg

Hi @e101sg,

The easiest way to customize the db-out output would be to have a Python postprocessing script that takes as input your jsonl dataset (the output of the built-in db-out command) and modifies it as needed.
The following script would transform the dataset according to the examples you provided:

import srsly
from typing import Any, List, Dict

def get_text(start: int, end: int, tokens: List[Dict[str, Any]]) -> str:
    return ''.join(token["text"] + (" " if token["ws"] else "") for token in tokens[start:end + 1])

def main():
    data = srsly.read_jsonl("input.jsonl")
    new_lines = []

    for eg in data:
        if eg["answer"] == "accept":
            spans = eg.get("spans", [])
            entities = [
                {
                    "text": get_text(span["token_start"], span["token_end"], eg["tokens"]),
                    "label": span["label"]
                }
                for span in spans
            ]
            new_lines.append({
                "text": eg["text"],
                "entities": entities
            })

    srsly.write_jsonl("input_modified.jsonl", new_lines)

if __name__ == "__main__":
    main()

I'm not familiar with your project, of course, but just a heads-up that the span offset information and the token information is required for training a spaCy pipeline if that's what the data will be used for. Also, the task hashes and annotator information is also useful to filter our duplicates and to compute inter-annotator agreement if you wish to do so in the future.

1 Like

Hi Magda !

Great! The script works correctly. Thanks a lot. Please close this thread as solved. :grinning: (Is it possible?)
Wonder how to created a "Solved" tag for my questions. Please advise. It will be useful others as well.
Thanks.
Cheers!
e101sg

Thanks @e101sg!
I just added the solved tag to this current question. I think you should be able to choose it from the tag options (just like added ner and usage)

When i hit reply button, i am only seeing this. i.e not seeing any option to any new tags besides the original tags added when question first asked.

Not sure, is it only me or others as well. Hope this will help or create awareness about the "solved" tags! :slight_smile: .

Actually, i wish to more past questions with solved tags !

Cheers!
e101sg

Thanks for pointing this out @e101sg! You're right, the users can set the initial tags but then only moderators can edit these tags. I guess it's our job then to tag issues as solved :sweat_smile: