Custom ner recipe doesn't work with patterns

Hi!
I am using a custom recipe for ner annotations since this way I can add a custom tokenizer and it works fine when I don't feed patterns into it and doesn't load anything when I do.
Here's a recipe:

@prodigy.recipe(
    "ner.custom",
    dataset=("The dataset to use", "positional", None, str),
    spacy_model=("The base model", "positional", None, str),
    source=("The source data as a JSONL file", "positional", None, str),
    label=("One or more comma-separated labels", "option", "l", split_string),
    patterns=("Optional match patterns", "option", "p", str),
    exclude=("Names of datasets to exclude", "option", "e", split_string),
)

def ner_manual(
    dataset: str,
    spacy_model: str,
    source: str,
    label: Optional[List[str]] = None,
    patterns: Optional[str] = None,
    exclude: Optional[List[str]] = None,
):
    """
    Mark spans manually by token. Requires only a tokenizer and no entity
    recognizer, and doesn't do any active learning.
    """
    log("RECIPE: Starting recipe ner.custo,", locals())
    # Load the spaCy model for tokenization
    nlp = spacy.load(spacy_model)
    log(f"RECIPE: Creating EntityRecognizer using model {spacy_model}")
    nlp.tokenizer = CTokenizer(nlp.vocab)
    log(f"RECIPE: Tokenizing")

    # Load the stream from a JSONL file and return a generator that yields a
    # dictionary for each example in the data.
    if source.endswith(".csv"):
        stream = get_stream(source, None, 'csv', rehash=True, dedup=True, input_key="text")
        
    else:
        stream = get_stream(source, None, 'jsonl', rehash=True, dedup=True, input_key="text")
        
    
  
    if patterns is not None:
        pattern_matcher = PatternMatcher(nlp, combine_matches=True, all_examples=True)
        pattern_matcher = pattern_matcher.from_disk(patterns)
        stream = (eg for _, eg in pattern_matcher(stream))

    # Tokenize the incoming examples and add a "tokens" property to each
    # example. Also handles pre-defined selected spans. Tokenization allows
    # faster highlighting, because the selection can "snap" to token boundaries.
    stream = add_tokens(nlp, stream)

    return {
        "view_id": "ner_manual",  # Annotation interface to use
        "dataset": dataset,  # Name of dataset to save annotations
        "stream": stream,  # Incoming stream of examples
        "exclude": exclude,  # List of dataset names to exclude
        "config": {  # Additional config settings, mostly for app UI
            "lang": nlp.lang,
            "labels": label, "exclude_by": "input" # Selectable label options
        },
    }

So this works

PRODIGY_LOGGING=verbose python3 -m prodigy ner.custom my_dataset xx_ent_wiki_sm my_file.csv -F ner_custom.py --label my_labels

And this works, but it doesn't suit me because of the tokenisation

PRODIGY_LOGGING=verbose python3 -m prodigy ner.manual my_dataset xx_ent_wiki_sm my_file.csv --label my_labels --patterns my_patterns.jsonl

While this doesn't

PRODIGY_LOGGING=verbose python3 -m prodigy ner.custom my_dataset xx_ent_wiki_sm my_file.csv -F ner_custom.py --label my_labels --patterns my_patterns.jsonl

With the following logging:

Any help much appreciated

What do you mean by "doesn't load anything"? Does it show you "no tasks available"?

it shows "Something went wrong"

And is there an error in the terminal? Typically, that should tell you the reason the server stopped.

The last logs are in the screenshot which I've attached to a question. Nothing more was printed out.

I managed to solve a packaging model with custom tokenizer so that one is not a problem anymore, but I tried to use it in a ner.manual recipe and the same problem occurs as with a custom recipe above.
The following command

python3 -m prodigy ner.manual my_dataset my_custom_packaged_model my_file.csv --label my_labels --patterns my_patterns.jsonl

is executed resulting in "Oops, something went wrong :("
with this logging which shows no particular error:

INFO:     198.16.66.155:30881 - "GET /project HTTP/1.1" 200 OK
22:06:58: POST: /get_session_questions
22:06:58: FEED: Finding next batch of questions in stream
22:06:58: PREPROCESS: Tokenizing examples
22:06:58: FILTER: Filtering duplicates from stream
{'by_input': True, 'by_task': True, 'stream': <generator object at 0x7f1bdd970948>, 'warn_fn': <bound method Printer.warn of <wasabi.printer.Printer object at 0x7f1bfecf3080>>, 'warn_threshold': 0.4}

22:06:58: FILTER: Filtering out empty examples for key 'text'

Hmm, strange :thinking: Which version of Prodigy are you using? And if you open the developer console in your browser, can you see an error message there?

I'm using prodigy.1.9.9
Twice there was

23:50:37: RESPONSE: /get_session_questions (10 examples)

printed with the example followed but nothing appeared in the interface and then

Invalid HTTP request received.
INFO:     208.91.109.18:56387 - "HEAD /robots.txt HTTP/1.0" 404 Not Found
Invalid HTTP request received.
INFO:     78.133.45.248:33816 - "GET / HTTP/1.1" 200 OK

I also tried to use this model for train ner recipe and it is stuck at

result = self._query(query)
✔ Loaded model 'xx_my_model_name'

stage and nothing happens in another 10 minutes now without any error message or anything at all

python3 -m prodigy train ner my_dataset xx_ent_wiki_sm --output my_best_model --eval-split 0.2 --n-iter 5 --batch-size 5 

works fine, while

python3 -m prodigy train ner my_dataset xx_my_model_name --output my_best_model --eval-split 0.2 --n-iter 5 --batch-size 5 

breaks with

  result = self._query(query)
✔ Loaded model 'xx_my_model_name'
Created and merged data for 22 total examples
Using 18 train / 4 eval (split 20%)
Component: ner | Batch size: 5 | Dropout: 0.2 | Iterations: 5
Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/me/prodigy_/prod/lib/python3.6/site-packages/prodigy/__main__.py", line 60, in <module>
    controller = recipe(*args, use_plac=True)
  File "cython_src/prodigy/core.pyx", line 213, in prodigy.core.recipe.recipe_decorator.recipe_proxy
  File "/home/me/prodigy_/prod/lib/python3.6/site-packages/plac_core.py", line 367, in call
    cmd, result = parser.consume(arglist)
  File "/home/me/prodigy_/prod/lib/python3.6/site-packages/plac_core.py", line 232, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "/home/me/prodigy_/prod/lib/python3.6/site-packages/prodigy/recipes/train.py", line 149, in train
    baseline = nlp.evaluate(eval_data)
  File "/home/me/prodigy_/prod/lib/python3.6/site-packages/spacy/language.py", line 692, in evaluate
    gold = GoldParse(doc, **gold)
  File "gold.pyx", line 808, in spacy.gold.GoldParse.__init__
IndexError: list index out of range

I think there are several, potential unrelated things going on here.

If you open the developer console in your browser, does it show something there?

That sounds like there's something very unexpected in the data here. What's in my_datasetand where does the data come from? Maybe try exporting it with db-out and see if there's anything in there that looks suspicious? For instance, do you have multiple examples of the same text with different tokenizations in there?

I will check this, thank you!