split_string utility doesn't honor quotes

Heather · September 22, 2023, 12:14am

Hi.

I'm using a variant of the choice recipe and passing options as a list of strings on the command line. Many of those strings contain commas, which shouldn't be a problem because they are also in quotes. Unfortunately, the split_string utility called in the recipe doesn't respect the quotes and is turning this:
-o ["foo","bar","Uses 'today,' 'tomorrow' or 'yesterday.'"]

Into this:

For this project, my options are just going to have to be grammatically incorrect. Is there another solution? When I tried to set my options within the recipe like this cat example, I got a react error. Please advise.

Thanks.

Heather · September 22, 2023, 1:35am

For the record, I also tried escaping the commas.

ryanwesslen · September 22, 2023, 1:23pm

Hi @Heather!

Excellent point! Yes, I see what you mean.

The current split_string is very simple:

def split_string(text: str) -> List[str]:
    """Split a string on commas. Mostly as a converter function in CLI argument
    annotation to convert comma-separated lists of labels.

    text (str): The text to split.
    RETURNS (list): The split text or empty list if text is false.
    """
    if not text:
        return []
    return [t.strip() for t in text.split(",")]

As a quick work around, replace the current split_string with this function to your recipe:

import re

def split_string(text: str) -> List[str]:
    """Split a string on commas, ignoring escaped commas (\\,).

    Args:
        text (str): The text to split.

    Returns:
        list: The split text or empty list if text is falsy.
    """
    if not text:
        return []
    # Split the text on commas that are not preceded by a backslash
    parts = re.split(r'(?<!\\),', text)
    # Remove escape characters from the split parts
    result = [part.replace("\\,", ",") for part in parts]
    return [t.strip() for t in result]

I then added this instead of importing the original string_split for the choice recipe:

# choice.py
import prodigy
from prodigy.components.loaders import JSONL
from typing import List
import re

def split_string(text: str) -> List[str]:
    """Split a string on commas, ignoring escaped commas (\\,).

    Args:
        text (str): The text to split.

    Returns:
        list: The split text or empty list if text is falsy.
    """
    if not text:
        return []
    # Split the text on commas that are not preceded by a backslash
    parts = re.split(r'(?<!\\),', text)
    # Remove escape characters from the split parts
    result = [part.replace("\\,", ",") for part in parts]
    return [t.strip() for t in result]


def add_options(stream, options):
    """Helper function to add options to every task in a stream."""
    options = [{"id": option, "text": option} for option in options]
    for task in stream:
        task["options"] = options
        yield task


# Recipe decorator with argument annotations: (description, argument type,
# shortcut, type / converter function called on value before it's passed to
# the function). Descriptions are also shown when typing --help.
@prodigy.recipe(
    "choice",
    dataset=("The dataset to use", "positional", None, str),
    source=("The source data as a JSONL file", "positional", None, str),
    options=("One or more comma-separated options", "option", "o", split_string),
    multiple=("Allow multiple choice", "flag", "M", bool),
)
def choice(dataset: str, source: str, options: List[str], multiple: bool = False):
    """
    Annotate data with multiple-choice options. The annotated examples will
    have an additional property `"accept": []` mapping to the ID(s) of the
    selected option(s).
    """
    # Load the stream from a JSONL file and return a generator that yields a
    # dictionary for each example in the data.
    stream = JSONL(source)

    # Add the options to all examples in the stream
    stream = add_options(stream, options)

    return {
        "view_id": "choice",  # Annotation interface to use
        "dataset": dataset,  # Name of dataset to save annotations
        "stream": stream,  # Incoming stream of examples
        "config": {  # Additional config settings
            # Allow multiple choice if flag is set
            "choice_style": "multiple" if multiple else "single",
            # Automatically accept and "lock in" selected answers if only
            # single choice is allowed
            "choice_auto_accept": False if multiple else True,
        },
    }

Running this:

python -m prodigy choice choice-data data/sms.jsonl -o foo,bar,"Uses 'today'\\, 'tomorrow' or 'yesterday.'" -F choice.py

And it worked!

But I'll put in a ticket to see if we can build-in this expanded string_split into Prodigy so you don't have to do this workaround in the future. Thanks for the feedback!

One last point - this reminded me that some of the prodigy-recipes are a bit old or not aligned with the built-in recipes. The prodigy-recipes folder README mentions this:

Important note: The recipes in this repository aren't 100% identical to the built-in recipes shipped with Prodigy. They've been edited to include comments and more information, and some of them have been simplified to make it easier to follow what's going on, and to use them as the basis for a custom recipe.

For example, they still use JSONL to load your file, which with a more recent version of Prodigy will yield this warning in the terminal:

⚠ Prodigy automatically assigned an input/task hash because it was
missing. This automatic hashing will be deprecated as of Prodigy v2 because it
can lead to unwanted duplicates in custom recipes if the examples deviate from
the default assumptions. More information can found on the docs:
https://prodi.gy/docs/api-components#set_hashes

You can avoid this by using get_stream instead of JSONL. get_stream calls set_hashes underneath. By default, many built-in recipes like ner.manual use it with these defaults:

stream = get_stream(
        source, rehash=True, dedup=True, input_key="text"
    )

If you weren't aware, you can look at the built-in recipes by finding your installed Prodigy site-packages folder. You can find this by running prodigy stats and looking at the Location: folder. From there, look for the recipes folder.

Hope this helps!

Heather · September 22, 2023, 9:17pm

Thanks for the quick response! I tried that but hit errors.

First, re was undefined so I changed that line to parts = text.split.... That prompted this error:

error: argument -o/--options: invalid split_string value:

Thanks.

ryanwesslen · September 22, 2023, 9:37pm

Could you add import re instead of changing re.split to text.split?

Sorry, I forgot to add that to the function snippet; it was in the modified choice.py recipe.

Heather · September 22, 2023, 9:40pm

I totally missed that! Thanks. I've added it and will see what happens when I set up the next round of annotation.
Thank you!

Heather · September 25, 2023, 3:34pm

It worked! Thanks for the quick fix!

Topic		Replies	Views
Reusing existing recipe_args usage , solved	5	672	December 30, 2017
Textcat correct recipe usage , textcat , solved	1	629	September 16, 2020
split_sents_threshold setting not working with custom ner.correct usage , custom	7	805	July 7, 2020
Bug with split_sentences/add_tokens in ner.batch-train ner , done	2	643	March 28, 2019
prodigy.components.preprocess.split_sentences docs , api	2	771	January 24, 2018

split_string utility doesn't honor quotes

Related topics