spans.manual merge tokens using db-out

I am wondering if there is a way to alter the output when using db-out and spans.manual to show the string associated with the span/label ("I live in Idaho" = SUPERFLUOUS_INFO, "openh" = TYPO, "chicken" = STT_ERROR) as well as the tokenized output.

Example:

PRODIGY_ALLOWED_SESSIONS=cheyanne PRODIGY_LOGGING=verbose prodigy spans.manual dataset_name blank:en /path/spans_test2.jsonl --label SUPERFLUOUS_INFO,TYPO,STT_ERROR

UI screenshot:

db-out

prodigy db-out dataset_name-cheyanne > my_output.jsonl

Tokenized span output:

{"text":"I live in idaho and I want to openh a chicken account","_input_hash":-1763618278,"_task_hash":-1693263404,"tokens":[{"text":"I","start":0,"end":1,"id":0,"ws":true},{"text":"live","start":2,"end":6,"id":1,"ws":true},{"text":"in","start":7,"end":9,"id":2,"ws":true},{"text":"idaho","start":10,"end":15,"id":3,"ws":true},{"text":"and","start":16,"end":19,"id":4,"ws":true},{"text":"I","start":20,"end":21,"id":5,"ws":true},{"text":"want","start":22,"end":26,"id":6,"ws":true},{"text":"to","start":27,"end":29,"id":7,"ws":true},{"text":"openh","start":30,"end":35,"id":8,"ws":true},{"text":"a","start":36,"end":37,"id":9,"ws":true},{"text":"chicken","start":38,"end":45,"id":10,"ws":true},{"text":"account","start":46,"end":53,"id":11,"ws":false}],"_view_id":"spans_manual","spans":[{"start":0,"end":15,"token_start":0,"token_end":3,"label":"SUPERFLUOUS_INFO"},{"start":30,"end":35,"token_start":8,"token_end":8,"label":"TYPO"},{"start":38,"end":45,"token_start":10,"token_end":10,"label":"STT_ERROR"}],"answer":"accept","_annotator_id":"dataset_name-cheyanne","_session_id":"dataset_name-cheyanne"}

Thank you,
Cheyanne

Hi @cheyanneb!

So you're just looking to add the raw span text to each span dict?

Thinking you could just add this to db-out:

for eg in examples:
    for span in eg["spans"]:
          span['text'] = eg['text'][span['start']:span['end']]

If you create a new flag argument (add_span_text) to turn this on or off (set off by default), you could run this

from pathlib import Path
from typing import Optional, Union

import srsly
from prodigy.components.db import connect
from prodigy.util import msg

def db_out(
    set_id: str,
    out_dir: Optional[Union[str, Path]] = None,
    answer: str = None,
    flagged_only: bool = False,
    dry: bool = False,
    add_span_text: bool = False,
) -> None:
    """
    Export annotations from the database. Files will be exported in
    Prodigy's JSONL format.
    """
    DB = connect()
    if set_id not in DB:
        msg.fail(f"Can't find '{set_id}' in database {DB.db_name}", exits=1)
    examples = DB.get_dataset_examples(set_id)
    if flagged_only:
        examples = [eg for eg in examples if eg.get("flagged")]
    if answer:
        examples = [eg for eg in examples if eg.get("answer") == answer]

    # add span text
    if add_span_text:
        for eg in examples:
            for span in eg["spans"]:
                span['text'] = eg['text'][span['start']:span['end']]

    if out_dir is None:
        for eg in examples:
            print(srsly.json_dumps(eg))
    else:
        out_dir = Path(out_dir)
        if not out_dir.exists():
            out_dir.mkdir()
        out_file = out_dir / f"{set_id}.jsonl"
        if not dry:
            srsly.write_jsonl(out_file, examples)
        msg.good(
            f"Exported {len(examples)} annotations from '{set_id}' in database {DB.db_name}",
            out_file.resolve(),
        )

Does this work?

1 Like

Thanks @ryanwesslen! Just a clarification: when you said "add this to db-out: is this in commands.py?

Yes! That's where db-out is. You can modify the built-in db-out or run it as a local script (e.g., adding -F my_dbout_script.py). It's your choice.

1 Like

This worked. Thank you!