empty spans and spans with no 'text' attribute

Hi. I'm having two issues. The output of db-out when I save annotations produces empty spans and spans with no 'text' attribute, while when I used db-out before, spans were an empty list ([ ]) or a list of dictionaries with a 'text' attribute (key). I need the 'text' attribute to know what the span refers to in the input text. Here is a minimal working example where I try to recreate the problem.

First, here is the command I use to start annotations:

python3 -m prodigy ner.manual identify_dosage_non_dosage_validate_data_SB2 en_core_web_lg validate_data_dosage_annotations_SB2.jsonl --label non_dosage,dosage

which calls the file:
validate_data_dosage_annotations_SB2.jsonl (52.3 KB)

Next, I perform the annotations, annotating text as a dosage or non-dosage.
Then I save the annotations with

python3 -m prodigy db-out identify_dosage_non_dosage_validate_data_SB2 > validate_data_dosage_non_dosage_annotations_SB2.jsonl

The annotations are saved here:
validate_data_dosage_non_dosage_annotations_SB2.jsonl (53.1 KB)

Now, in order to visualize the annotations in a spreadsheet, I use this script in Python:

import pandas as pd

df_jsonl_annotations = pd.read_json('validate_data_dosage_non_dosage_annotations_SB2.jsonl', lines=True)

df_jsonl_annotations.to_csv('validate_data_dosage_non_dosage_annotations_SB2.csv', index=False)

The results can be seen here (originally a CSV file) (which I have truncated for readability):

text	_input_hash	_task_hash	_is_binary	tokens	_view_id	answer	_timestamp	spans
January 9 - 241 6 - 375mg split into 3 doses. 96m deadlift/back/shoulder session. 30m cardio. 7,872 steps. 1,640 calories at 17g (7g net) carbs, 93g fat, 128g protein. 1 5g water January 10 - 241 4 - 375mg	1201376478	-478339982	FALSE	[{'text': 'January', 'start': 0, 'end': 7, 'id': 0, 'ws': True}, {'text': '9', 'start': 8, 'end': 9, 'id': 1, 'ws': True}, {'text': '-', 'start': 10, 'end': 11, 'id': 2, 'ws': True}	ner_manual	accept	1673316381	[{'start': 18, 'end': 25, 'token_start': 5, 'token_end': 7, 'label': 'non_dosage'}, {'start': 125, 'end': 143, 'token_start': 32, 'token_end': 39, 'label': 'non_dosage'}
 Originally Posted by itismethebeeFirst off, I turned 18 this year.	-681172102	1046073490	FALSE	[{'text': ' ', 'start': 0, 'end': 1, 'id': 0, 'ws': False}, {'text': 'Originally', 'start': 1, 'end': 11, 'id': 1, 'ws': True}, {'text': 'Posted', 'start': 12, 'end': 18, 'id': 2, 'ws': True}				
d': 484	 'ws': True}	 {'text': 'this'	 'start': 2191	 'end': 2195	 'id': 485	 'ws': True}	 {'text': 'post'	

As you can see, the first line contains spans with no 'text' attribute, while the second contains empty spans, which I don't think should be the case.

hi @stefan.bartell!

Thanks for your question and welcome to the Prodigy community :wave:

This is a bit odd. Your annotated file has "answer":"accept" tags for each record, indicating they were accepted (saved), but yes, they should include your spans as a list of dictionaries, dictionary per span. For ner.manual, saved annotated spans will be in spans. See this link for what the data looks like for it.

When I did the steps below, everything worked out fine:

python3 -m prodigy ner.manual identify_dosage_non_dosage_validate_data_SB2 en_core_web_lg validate_data_dosage_annotations_SB2.jsonl --label non_dosage,dosage

Then annotated two spans, accepting them by clicking the Green "Accept" button.

Then on the next record, I clicked save at the top:

I can now go back to my terminal and shut down the server by pressing CTRL + C. When I do this, you can also confirm whether your annotation was saved in the CLI:

$ python3 -m prodigy ner.manual dosage_dataset en_core_web_lg data/validate_data_dosage_annotations_SB2.jsonl --label non_dosage,dosage
Using 2 label(s): non_dosage, dosage
Added dataset dosage_dataset to database SQLite.

✨  Starting the web server at http://localhost:8080 ...
Open the app in your browser and start annotating!

^C
✔ Saved 1 annotations to database SQLite
Dataset: dosage_dataset
Session ID: 2023-01-10_14-17-06

Now if I output out that file with db-out:

python3 -m prodigy db-out dosage_dataset > dosage.jsonl

I get:

{
  "text": "January 9 - 241 6 - 375mg split into 3 doses. 96m deadlift/back/shoulder session. 30m cardio. 7,872 steps. 1,640 calories at 17g (7g net) carbs, 93g fat, 128g protein. 1 5g water January 10 - 241 4 - 375mg split into 3 doses. 74m arms session. 30m cardio. 8,402 steps. 1,640 calories at 26g (15g net) carbs, 127g fat, 106g protein. 1 5g water Expected a weight drop by now so I hope it's just water retention as I've been religious with everything and eating has been on point. My circadian rhythms do usually ebb and flow where I'll get a \"whoosh\" weight drop every once in awhile. I'm gonna keep on, keepin' on.",
  "_input_hash": 1201376478,
  "_task_hash": -478339982,
  "_is_binary": false,
  "tokens": [
    {
      "text": "January",
      "start": 0,
      "end": 7,
      "id": 0,
      "ws": true
    },
    .
    .
    .
    {
      "text": ".",
      "start": 612,
      "end": 613,
      "id": 163,
      "ws": false
    }
  ],
  "_view_id": "ner_manual",
  "answer": "accept",
  "_timestamp": 1673378743,
  "spans": [
    {
      "start": 12,
      "end": 44,
      "token_start": 3,
      "token_end": 11,
      "label": "dosage"
    },
    {
      "start": 192,
      "end": 224,
      "token_start": 56,
      "token_end": 64,
      "label": "dosage"
    }
  ]
}

These are the two spans that were saved.

Now those spans do not include by default the raw span text. You can add this by modifying your db-out recipe:

Can you double check that you annotated correctly (e.g., highlighting the spans, clicking Accept (green) button, and saving your annotations by clicking the Save Button)?

Thanks @ryanwesslen. It looks like you were able to reproduce the problem. I read your reply to @cheyanneb. The code you suggested to add to db-out:

for eg in examples:
    for span in eg["spans"]:
          span['text'] = eg['text'][span['start']:span['end']]

You indicated that db-out is located in commands.py. What is the path to commands.py? I have multiple files with that name on my machine, but I can't tell if any of them are for Prodigy.

As for the other code you suggested:

from pathlib import Path
from typing import Optional, Union

import srsly
from prodigy.components.db import connect
from prodigy.util import msg

def db_out(
    set_id: str,
    out_dir: Optional[Union[str, Path]] = None,
    answer: str = None,
    flagged_only: bool = False,
    dry: bool = False,
    add_span_text: bool = False,
) -> None:
    """
    Export annotations from the database. Files will be exported in
    Prodigy's JSONL format.
    """
    DB = connect()
    if set_id not in DB:
        msg.fail(f"Can't find '{set_id}' in database {DB.db_name}", exits=1)
    examples = DB.get_dataset_examples(set_id)
    if flagged_only:
        examples = [eg for eg in examples if eg.get("flagged")]
    if answer:
        examples = [eg for eg in examples if eg.get("answer") == answer]

    # add span text
    if add_span_text:
        for eg in examples:
            for span in eg["spans"]:
                span['text'] = eg['text'][span['start']:span['end']]

    if out_dir is None:
        for eg in examples:
            print(srsly.json_dumps(eg))
    else:
        out_dir = Path(out_dir)
        if not out_dir.exists():
            out_dir.mkdir()
        out_file = out_dir / f"{set_id}.jsonl"
        if not dry:
            srsly.write_jsonl(out_file, examples)
        msg.good(
            f"Exported {len(examples)} annotations from '{set_id}' in database {DB.db_name}",
            out_file.resolve(),
        )

I wasn't sure how to run this. Am I adding it to a file or is it in its own file? Is it in my_dbout_script.py, which is run with -F my_dbout_script.py? How do I set add_span_text to True?

This is based on where your Prodigy library is installed.

Type in python -m prodigy stats then find the Location: folder. From there, open that Location path in a window and look for the recipes/commands.py script. (FYI you can find other Prodigy recipes in that recipes/ folder too).

Yes, the easiest way would be to run it as a custom recipe. But to run it from the command line, you will need to wrap the @prodigy.recipe decorator around your function.

@prodigy.recipe(
    "db-out",
    set_id=("Name of dataset to export", "positional", None, str),
    out_dir=("Path to output directory", "positional", None, str),
    answer=("Only export annotations with this answer", "option", "a", str),
    flagged_only=("DEPRECATED: Only export flagged annotations", "flag", "F", bool),
    dry=("Perform a dry run", "flag", "D", bool),
    add_span_text=("Flag to add in the text spans", "flag", None, bool),
)

If you're new to decorators, here's a great tutorial on them.

Then you should be able to run:

python -m prodigy db-out my_dataset --add_span_text -F my_dbout_script.py

Can you show the code snippet for where

for eg in examples:
        for span in eg["spans"]:
          span['text'] = eg['text'][span['start']:span['end']]

is added to commands.py? I tried adding it to @recipe( "db-out" and def db_out( but got syntax errors.

This code snippet isn't included in the commands.py by default. It's why I was suggesting that if you want to run something like db-out, likely your best bet is to create a custom recipe and run that separately. Let me know if you have any further questions!

Hi @ryanwesslen. I appreciate your suggestions. Here is what I tried: saved the code here as my_dbout_script.py:

import prodigy
from pathlib import Path
from typing import Optional, Union

import srsly
from prodigy.components.db import connect
from prodigy.util import msg

@prodigy.recipe(
    "db-out",
    set_id=("Name of dataset to export", "positional", None, str),
    out_dir=("Path to output directory", "positional", None, str),
    answer=("Only export annotations with this answer", "option", "a", str),
    flagged_only=("DEPRECATED: Only export flagged annotations", "flag", "F", bool),
    dry=("Perform a dry run", "flag", "D", bool),
    add_span_text=("Flag to add in the text spans", "flag", None, bool),
)

def db_out(
    set_id: str,
    out_dir: Optional[Union[str, Path]] = None,
    answer: str = None,
    flagged_only: bool = False,
    dry: bool = False,
    add_span_text: bool = False,
) -> None:
    """
    Export annotations from the database. Files will be exported in
    Prodigy's JSONL format.
    """
    DB = connect()
    if set_id not in DB:
        msg.fail(f"Can't find '{set_id}' in database {DB.db_name}", exits=1)
    examples = DB.get_dataset_examples(set_id)
    if flagged_only:
        examples = [eg for eg in examples if eg.get("flagged")]
    if answer:
        examples = [eg for eg in examples if eg.get("answer") == answer]

    # add span text
    if add_span_text:
        for eg in examples:
            for span in eg["spans"]:
                span['text'] = eg['text'][span['start']:span['end']]

    if out_dir is None:
        for eg in examples:
            print(srsly.json_dumps(eg))
    else:
        out_dir = Path(out_dir)
        if not out_dir.exists():
            out_dir.mkdir()
        out_file = out_dir / f"{set_id}.jsonl"
        if not dry:
            srsly.write_jsonl(out_file, examples)
        msg.good(
            f"Exported {len(examples)} annotations from '{set_id}' in database {DB.db_name}",
            out_file.resolve(),
        )

Then ran the command:

python3 -m prodigy db-out identify_dosage_non_dosage_validate_data_SB2 > validate_data_dosage_non_dosage_annotations_SB2.jsonl -add_span_text -F my_dbout_script.py

But this resulted in a blank validate_data_dosage_non_dosage_annotations_SB2.jsonl file. I also reran the annotations to make sure they weren't empty. Do you have an idea of what I am doing wrong?

Thinking more, let's just eliminate the -add_span_text and make it by default it always adds the texts. If you don't want to include it, you can just use the standard db-out.

So use this script:

#my_dbout_script
import prodigy
from pathlib import Path
from typing import Optional, Union

import srsly
from prodigy.components.db import connect
from prodigy.util import msg

@prodigy.recipe(
    "db-out",
    set_id=("Name of dataset to export", "positional", None, str),
    out_dir=("Path to output directory", "positional", None, str),
    answer=("Only export annotations with this answer", "option", "a", str),
    flagged_only=("DEPRECATED: Only export flagged annotations", "flag", "F", bool),
    dry=("Perform a dry run", "flag", "D", bool),
)

def db_out(
    set_id: str,
    out_dir: Optional[Union[str, Path]] = None,
    answer: str = None,
    flagged_only: bool = False,
    dry: bool = False,
) -> None:
    """
    Export annotations from the database. Files will be exported in
    Prodigy's JSONL format.
    """
    DB = connect()
    if set_id not in DB:
        msg.fail(f"Can't find '{set_id}' in database {DB.db_name}", exits=1)
    examples = DB.get_dataset_examples(set_id)
    if flagged_only:
        examples = [eg for eg in examples if eg.get("flagged")]
    if answer:
        examples = [eg for eg in examples if eg.get("answer") == answer]

    for eg in examples:
        if eg.get('spans') is not None:
            for span in eg.get('spans'):
                span['text'] = eg['text'][span['start']:span['end']]

    if out_dir is None:
        for eg in examples:
            print(srsly.json_dumps(eg))
    else:
        out_dir = Path(out_dir)
        if not out_dir.exists():
            out_dir.mkdir()
        out_file = out_dir / f"{set_id}.jsonl"
        if not dry:
            srsly.write_jsonl(out_file, examples)
        msg.good(
            f"Exported {len(examples)} annotations from '{set_id}' in database {DB.db_name}",
            out_file.resolve(),
        )

Then try this:

python3 -m prodigy db-out identify_dosage_non_dosage_validate_data_SB2 -F my_dbout_script.py > validate_data_dosage_non_dosage_annotations_SB2.jsonl

Then I get the spans:

{
  "text": "January 9 - 241 6 - 375mg split into 3 doses. 96m deadlift/back/shoulder session. 30m cardio. 7,872 steps. 1,640 calories at 17g (7g net) carbs, 93g fat, 128g protein. 1 5g water January 10 - 241 4 - 375mg split into 3 doses. 74m arms session. 30m cardio. 8,402 steps. 1,640 calories at 26g (15g net) carbs, 127g fat, 106g protein. 1 5g water Expected a weight drop by now so I hope it's just water retention as I've been religious with everything and eating has been on point. My circadian rhythms do usually ebb and flow where I'll get a \"whoosh\" weight drop every once in awhile. I'm gonna keep on, keepin' on.",
  "_input_hash": 1201376478,
  "_task_hash": -478339982,
  "_is_binary": false,
  "tokens": [
    {
      "text": "January",
      "start": 0,
      "end": 7,
      "id": 0,
      "ws": true
    },
    {
      "text": "9",
      "start": 8,
      "end": 9,
      "id": 1,
      "ws": true
    },
...
    {
      "text": ".",
      "start": 612,
      "end": 613,
      "id": 163,
      "ws": false
    }
  ],
  "_view_id": "ner_manual",
  "answer": "accept",
  "_timestamp": 1673378743,
  "spans": [
    {
      "start": 12,
      "end": 44,
      "token_start": 3,
      "token_end": 11,
      "label": "dosage",
      "text": "241 6 - 375mg split into 3 doses"
    },
    {
      "start": 192,
      "end": 224,
      "token_start": 56,
      "token_end": 64,
      "label": "dosage",
      "text": "241 4 - 375mg split into 3 doses"
    }
  ]
}

Looks like it's working for you! I guess it is something small that is still causing me trouble. I'm getting File ...line 40, in db_out for span in eg["spans"]: KeyError: 'spans'

Ah, sorry! Completely forgot. You get this if you have a record that can't find a spans.

Change to this (I've also updated the code above):

    for eg in examples:
        if eg.get('spans') is not None:
            for span in eg.get('spans'):
                span['text'] = eg['text'][span['start']:span['end']]

By using eg.get('spans') instead of eg["spans"] you won't get an error when it doesn't find a key.

Crossing fingers that this should work :crossed_fingers:

Looks like it's working! Thanks for all your help!

1 Like