spans.manual merge tokens using db-out

cheyanneb · November 15, 2022, 5:23pm

I am wondering if there is a way to alter the output when using db-out and spans.manual to show the string associated with the span/label ("I live in Idaho" = SUPERFLUOUS_INFO, "openh" = TYPO, "chicken" = STT_ERROR) as well as the tokenized output.

Example:

PRODIGY_ALLOWED_SESSIONS=cheyanne PRODIGY_LOGGING=verbose prodigy spans.manual dataset_name blank:en /path/spans_test2.jsonl --label SUPERFLUOUS_INFO,TYPO,STT_ERROR

UI screenshot:

db-out

prodigy db-out dataset_name-cheyanne > my_output.jsonl

Tokenized span output:

{"text":"I live in idaho and I want to openh a chicken account","_input_hash":-1763618278,"_task_hash":-1693263404,"tokens":[{"text":"I","start":0,"end":1,"id":0,"ws":true},{"text":"live","start":2,"end":6,"id":1,"ws":true},{"text":"in","start":7,"end":9,"id":2,"ws":true},{"text":"idaho","start":10,"end":15,"id":3,"ws":true},{"text":"and","start":16,"end":19,"id":4,"ws":true},{"text":"I","start":20,"end":21,"id":5,"ws":true},{"text":"want","start":22,"end":26,"id":6,"ws":true},{"text":"to","start":27,"end":29,"id":7,"ws":true},{"text":"openh","start":30,"end":35,"id":8,"ws":true},{"text":"a","start":36,"end":37,"id":9,"ws":true},{"text":"chicken","start":38,"end":45,"id":10,"ws":true},{"text":"account","start":46,"end":53,"id":11,"ws":false}],"_view_id":"spans_manual","spans":[{"start":0,"end":15,"token_start":0,"token_end":3,"label":"SUPERFLUOUS_INFO"},{"start":30,"end":35,"token_start":8,"token_end":8,"label":"TYPO"},{"start":38,"end":45,"token_start":10,"token_end":10,"label":"STT_ERROR"}],"answer":"accept","_annotator_id":"dataset_name-cheyanne","_session_id":"dataset_name-cheyanne"}

Thank you,
Cheyanne

ryanwesslen · November 16, 2022, 4:17pm

Hi @cheyanneb!

So you're just looking to add the raw span text to each span dict?

Thinking you could just add this to db-out:

for eg in examples:
    for span in eg["spans"]:
          span['text'] = eg['text'][span['start']:span['end']]

If you create a new flag argument (add_span_text) to turn this on or off (set off by default), you could run this

from pathlib import Path
from typing import Optional, Union

import srsly
from prodigy.components.db import connect
from prodigy.util import msg

def db_out(
    set_id: str,
    out_dir: Optional[Union[str, Path]] = None,
    answer: str = None,
    flagged_only: bool = False,
    dry: bool = False,
    add_span_text: bool = False,
) -> None:
    """
    Export annotations from the database. Files will be exported in
    Prodigy's JSONL format.
    """
    DB = connect()
    if set_id not in DB:
        msg.fail(f"Can't find '{set_id}' in database {DB.db_name}", exits=1)
    examples = DB.get_dataset_examples(set_id)
    if flagged_only:
        examples = [eg for eg in examples if eg.get("flagged")]
    if answer:
        examples = [eg for eg in examples if eg.get("answer") == answer]

    # add span text
    if add_span_text:
        for eg in examples:
            for span in eg["spans"]:
                span['text'] = eg['text'][span['start']:span['end']]

    if out_dir is None:
        for eg in examples:
            print(srsly.json_dumps(eg))
    else:
        out_dir = Path(out_dir)
        if not out_dir.exists():
            out_dir.mkdir()
        out_file = out_dir / f"{set_id}.jsonl"
        if not dry:
            srsly.write_jsonl(out_file, examples)
        msg.good(
            f"Exported {len(examples)} annotations from '{set_id}' in database {DB.db_name}",
            out_file.resolve(),
        )

Does this work?

cheyanneb · November 16, 2022, 11:52pm

Thanks @ryanwesslen! Just a clarification: when you said "add this to db-out: is this in commands.py?

ryanwesslen · November 17, 2022, 1:38pm

Yes! That's where db-out is. You can modify the built-in db-out or run it as a local script (e.g., adding -F my_dbout_script.py). It's your choice.

cheyanneb · November 18, 2022, 3:25am

This worked. Thank you!

Topic		Replies	Views
Dataset output -format usage , ner , solved	5	172	June 10, 2024
TypeError when reviewing annotations spans.manual spancat	3	288	January 6, 2023
rel.manual with pre-labelled spans displays message "No Tasks Available" usage , solved , relations	1	410	April 10, 2022
empty spans and spans with no 'text' attribute database , solved	10	584	January 11, 2023
relation recipe missing span annotation on custom tokens because of tokenization didnt match relations , spancat	1	350	September 15, 2022

spans.manual merge tokens using db-out

Related topics