split_sents_threshold setting not working with custom ner.correct

Hello,

I have a custom recipe for ner.correct (code is below), and I would like to be able to configure the threshold for sentence splitting the way that you would in other recipes (with the split_sents and split_sents_threshold settings).

I found strange behaviour combining these settings with the --unsegmented flag. If I add the flag to the prodigy command, like below, I get unsegmented text that ignores the split_sents_threshold setting.

python3 load_prodigy_data.py some_customer some_input_table | prodigy ner.correct.custom dataset_name en_core_web_lg - -F recipes.py --label PERSON --customer some_customer -U

If I omit that flag, sentences are split in a way that also ignores the split_sents threshold setting.

Do I have to implement my own stream in order to configure sentence splitting with a custom ner.correct?

@prodigy.recipe(
    "ner.correct.custom",
    dataset=("Dataset to save annotations to", "positional", None, str),
    spacy_model=("Loadable spaCy model with an entity recognizer", "positional", None, str),
    source=("Data to annotate (file path or '-' to read from standard input)", "positional", None, str),
    api=("DEPRECATED: API loader to use", "option", "a", str),
    loader=("Loader (guessed from file extension if not set)", "option", "lo", str),
    label=("Comma-separated label(s) to annotate or text file with one label per line", "option", "l", get_labels),
    exclude=("Comma-separated list of dataset IDs whose annotations to exclude", "option", "e", split_string),
    unsegmented=("Don't split sentences", "flag", "U", bool),
    # Add custom recipe CLI arguments
    customer=("Name of company customer", "option", "db", str),

)
def custom_correct(
        dataset,
        spacy_model,
        source,
        api=None,
        loader=None,
        label=None,
        exclude=None,
        unsegmented=False,
        customer=None
):
    components = make_gold(dataset, spacy_model, source, api, loader, label, exclude, unsegmented)
    

    # Overwrite recipe components returned by the recipe and use custom arguments
    components["db"] = get_rds(customer)
    # Overwrite config settings
    components["config"]["exclude_by"] = "task"
    # Return recipe components
    components["config"] = {**components["config"], ***{"host":"0.0.0.0", "split_sents": True, "split_sents_threshold":1000}}
    
    return components

Hi! Sorry if the docs were confusing on this – but if you set --unsegmented, the sentence segmenter won't be applied and you'll see whatever comes in. So in that case it's expected that the split_sents_threshold isn't applied.

For the other case: can you check your prodigy.json and see if you maybe have a conflicting setting for the split_sents_threshold in there? Because that would override the recipe default.

I should've mentioned this in the original post, but the prodigy.json actually has
{ "split_sents":true, "split_sents_threshold":2000}

I tried taking the added config out of the recipe too, I'm till getting sentences that are split way below the 2000 character threshold.

Any update on this? I looked at my prodigy.json, the sentence splitting threshold I defined there is way above the size of the fragments I get.

If I turn on --unsegmented, nothing gets split. If I omit that flag, and leave the prodigy.json to take over, some inputs don't get split prematurely whereas others do. I cross referenced what I see in the Prodigy UI with the outputs of my custom loader to confirm that the splitting is happening in Prodigy, not in my custom loader.

I haven't been able to reproduce this, but I just re-read the thread again and have one follow-up question:

By "below the 2000 character threshold", do you mean, examples with text under 2000 characters ends up split, or split examples end up with fewer than 2000 characters?

Because the latter would be expected – the split_sents_threshold defines the minimum character length a text needs to have to be segmented into sentences. So you could have one example with 1999 characters that gets sent out without segmentation, and then one with 2000 characters that gets segmented into 10 sentences of 20 characters each.

For illustration purposes, here's a simplified version of that function:

import copy
from prodigy.util import set_hashes

def split_sentences(stream, nlp, min_length=False):
    tuples = ((eg["text"], eg) for eg in stream)
    for doc, orig_eg in nlp.pipe(tuples, as_tuples=True):
        orig_eg = copy.deepcopy(orig_eg)
        if min_length and len(doc.text) < min_length:
            yield orig_eg
        else:
            for sent in doc.sents:
                eg = copy.deepcopy(orig_eg)
                eg["text"] = sent.text
                eg = set_hashes(eg, overwrite=True)
                yield eg

If what you're looking for is a preprocessor that always outputs as many sentences as possible at once, but never more than 2000 characters, you could modify the function above and maybe just check len(sent) and concatenate the sentences until you hit the threshold, send out an example and then start again etc.

That clarifies it. I was expecting input examples with >= 2000 characters to get split up into examples with 2000 characters. I didn't understand that they would get split up into fragments of 20 characters. Thank you!

I read somewhere else that you recommend using smaller fragments to improve model performance. Does the default 20 character fragment size correspond to some kind of optimal context window for your models? I'm not set on 2000 characters (though I believe that a feature that lets you control fragment length more easily would be nice).

Ah, sorry if I phrased that in a confusing way – it doesn't necessarily mean 20, this was just an example I made up. I just meant, however many sentences are within those 2000 characters. So it's totally possible that a large split_sents_threshold still produces very short examples, if the original text is > 2000 characters long, and includes multiple sentences. Those will be split and sent out sentence-by-sentence.

For custom segmentation, adding the sentence splitting to your custom loader would make sense. This gives you full control over how the text is segmented and also makes the process more transparent.

The split_sents_threshold is really just a fallback. We initially added it as a "safety" strategy, especially when annotating with a model in the loop or getting multiple possible predictions from a model for each given example. If you end up with one really long example in your data by accident, this could otherwise crash the model and cause you to run out of memory. So setting a threshold means that in the worst case scenario, your text will be split into sentences so your model doesn't choke on it.)

Hey, just thought I'd post an update re this solution...it was working great for a time, but ran into a difficulty.

I'm not sure if you recently pushed a change to the parser that breaks a piece of text down into sentences, but (using the sentence splitting solution above) I'm ending up with samples now that were different than ones I had a couple of weeks ago for the same input text. My guess was that some change got pushed, changing how the samples got decomposed, which in turn changed what text got concatenated to a sample before it reached the cut-off size of 1000 characters.

This threw off the hashing, which is no biggie in the short-term...I can always recreate new tables that omit a subset of the data I've already annotated. If I'm correct though, this solution does pose a long-term problem. I may try a custom hashing implementation, or drop rows from my input dataset in my recipe, as a workaround.