I'm using a variant of the choice recipe and passing options as a list of strings on the command line. Many of those strings contain commas, which shouldn't be a problem because they are also in quotes. Unfortunately, the split_string utility called in the recipe doesn't respect the quotes and is turning this:
-o ["foo","bar","Uses 'today,' 'tomorrow' or 'yesterday.'"]
Into this:
For this project, my options are just going to have to be grammatically incorrect. Is there another solution? When I tried to set my options within the recipe like this cat example, I got a react error. Please advise.
def split_string(text: str) -> List[str]:
"""Split a string on commas. Mostly as a converter function in CLI argument
annotation to convert comma-separated lists of labels.
text (str): The text to split.
RETURNS (list): The split text or empty list if text is false.
"""
if not text:
return []
return [t.strip() for t in text.split(",")]
As a quick work around, replace the current split_string with this function to your recipe:
import re
def split_string(text: str) -> List[str]:
"""Split a string on commas, ignoring escaped commas (\\,).
Args:
text (str): The text to split.
Returns:
list: The split text or empty list if text is falsy.
"""
if not text:
return []
# Split the text on commas that are not preceded by a backslash
parts = re.split(r'(?<!\\),', text)
# Remove escape characters from the split parts
result = [part.replace("\\,", ",") for part in parts]
return [t.strip() for t in result]
I then added this instead of importing the original string_split for the choice recipe:
# choice.py
import prodigy
from prodigy.components.loaders import JSONL
from typing import List
import re
def split_string(text: str) -> List[str]:
"""Split a string on commas, ignoring escaped commas (\\,).
Args:
text (str): The text to split.
Returns:
list: The split text or empty list if text is falsy.
"""
if not text:
return []
# Split the text on commas that are not preceded by a backslash
parts = re.split(r'(?<!\\),', text)
# Remove escape characters from the split parts
result = [part.replace("\\,", ",") for part in parts]
return [t.strip() for t in result]
def add_options(stream, options):
"""Helper function to add options to every task in a stream."""
options = [{"id": option, "text": option} for option in options]
for task in stream:
task["options"] = options
yield task
# Recipe decorator with argument annotations: (description, argument type,
# shortcut, type / converter function called on value before it's passed to
# the function). Descriptions are also shown when typing --help.
@prodigy.recipe(
"choice",
dataset=("The dataset to use", "positional", None, str),
source=("The source data as a JSONL file", "positional", None, str),
options=("One or more comma-separated options", "option", "o", split_string),
multiple=("Allow multiple choice", "flag", "M", bool),
)
def choice(dataset: str, source: str, options: List[str], multiple: bool = False):
"""
Annotate data with multiple-choice options. The annotated examples will
have an additional property `"accept": []` mapping to the ID(s) of the
selected option(s).
"""
# Load the stream from a JSONL file and return a generator that yields a
# dictionary for each example in the data.
stream = JSONL(source)
# Add the options to all examples in the stream
stream = add_options(stream, options)
return {
"view_id": "choice", # Annotation interface to use
"dataset": dataset, # Name of dataset to save annotations
"stream": stream, # Incoming stream of examples
"config": { # Additional config settings
# Allow multiple choice if flag is set
"choice_style": "multiple" if multiple else "single",
# Automatically accept and "lock in" selected answers if only
# single choice is allowed
"choice_auto_accept": False if multiple else True,
},
}
But I'll put in a ticket to see if we can build-in this expanded string_split into Prodigy so you don't have to do this workaround in the future. Thanks for the feedback!
One last point - this reminded me that some of the prodigy-recipes are a bit old or not aligned with the built-in recipes. The prodigy-recipes folder README mentions this:
Important note: The recipes in this repository aren't 100% identical to the built-in recipes shipped with Prodigy. They've been edited to include comments and more information, and some of them have been simplified to make it easier to follow what's going on, and to use them as the basis for a custom recipe.
For example, they still use JSONL to load your file, which with a more recent version of Prodigy will yield this warning in the terminal:
⚠ Prodigy automatically assigned an input/task hash because it was
missing. This automatic hashing will be deprecated as of Prodigy v2 because it
can lead to unwanted duplicates in custom recipes if the examples deviate from
the default assumptions. More information can found on the docs:
https://prodi.gy/docs/api-components#set_hashes
You can avoid this by using get_stream instead of JSONL. get_stream calls set_hashes underneath. By default, many built-in recipes like ner.manual use it with these defaults:
If you weren't aware, you can look at the built-in recipes by finding your installed Prodigy site-packages folder. You can find this by running prodigy stats and looking at the Location: folder. From there, look for the recipes folder.