Hello,
I have a custom recipe for ner.correct (code is below), and I would like to be able to configure the threshold for sentence splitting the way that you would in other recipes (with the split_sents and split_sents_threshold settings).
I found strange behaviour combining these settings with the --unsegmented flag. If I add the flag to the prodigy command, like below, I get unsegmented text that ignores the split_sents_threshold setting.
python3 load_prodigy_data.py some_customer some_input_table | prodigy ner.correct.custom dataset_name en_core_web_lg - -F recipes.py --label PERSON --customer some_customer -U
If I omit that flag, sentences are split in a way that also ignores the split_sents threshold setting.
Do I have to implement my own stream in order to configure sentence splitting with a custom ner.correct?
@prodigy.recipe(
"ner.correct.custom",
dataset=("Dataset to save annotations to", "positional", None, str),
spacy_model=("Loadable spaCy model with an entity recognizer", "positional", None, str),
source=("Data to annotate (file path or '-' to read from standard input)", "positional", None, str),
api=("DEPRECATED: API loader to use", "option", "a", str),
loader=("Loader (guessed from file extension if not set)", "option", "lo", str),
label=("Comma-separated label(s) to annotate or text file with one label per line", "option", "l", get_labels),
exclude=("Comma-separated list of dataset IDs whose annotations to exclude", "option", "e", split_string),
unsegmented=("Don't split sentences", "flag", "U", bool),
# Add custom recipe CLI arguments
customer=("Name of company customer", "option", "db", str),
)
def custom_correct(
dataset,
spacy_model,
source,
api=None,
loader=None,
label=None,
exclude=None,
unsegmented=False,
customer=None
):
components = make_gold(dataset, spacy_model, source, api, loader, label, exclude, unsegmented)
# Overwrite recipe components returned by the recipe and use custom arguments
components["db"] = get_rds(customer)
# Overwrite config settings
components["config"]["exclude_by"] = "task"
# Return recipe components
components["config"] = {**components["config"], ***{"host":"0.0.0.0", "split_sents": True, "split_sents_threshold":1000}}
return components