Custom ner recipe doesn't work with patterns

kak-to-tak · April 7, 2020, 9:00am

Hi!
I am using a custom recipe for ner annotations since this way I can add a custom tokenizer and it works fine when I don't feed patterns into it and doesn't load anything when I do.
Here's a recipe:

@prodigy.recipe(
    "ner.custom",
    dataset=("The dataset to use", "positional", None, str),
    spacy_model=("The base model", "positional", None, str),
    source=("The source data as a JSONL file", "positional", None, str),
    label=("One or more comma-separated labels", "option", "l", split_string),
    patterns=("Optional match patterns", "option", "p", str),
    exclude=("Names of datasets to exclude", "option", "e", split_string),
)

def ner_manual(
    dataset: str,
    spacy_model: str,
    source: str,
    label: Optional[List[str]] = None,
    patterns: Optional[str] = None,
    exclude: Optional[List[str]] = None,
):
    """
    Mark spans manually by token. Requires only a tokenizer and no entity
    recognizer, and doesn't do any active learning.
    """
    log("RECIPE: Starting recipe ner.custo,", locals())
    # Load the spaCy model for tokenization
    nlp = spacy.load(spacy_model)
    log(f"RECIPE: Creating EntityRecognizer using model {spacy_model}")
    nlp.tokenizer = CTokenizer(nlp.vocab)
    log(f"RECIPE: Tokenizing")

    # Load the stream from a JSONL file and return a generator that yields a
    # dictionary for each example in the data.
    if source.endswith(".csv"):
        stream = get_stream(source, None, 'csv', rehash=True, dedup=True, input_key="text")
        
    else:
        stream = get_stream(source, None, 'jsonl', rehash=True, dedup=True, input_key="text")
        
    
  
    if patterns is not None:
        pattern_matcher = PatternMatcher(nlp, combine_matches=True, all_examples=True)
        pattern_matcher = pattern_matcher.from_disk(patterns)
        stream = (eg for _, eg in pattern_matcher(stream))

    # Tokenize the incoming examples and add a "tokens" property to each
    # example. Also handles pre-defined selected spans. Tokenization allows
    # faster highlighting, because the selection can "snap" to token boundaries.
    stream = add_tokens(nlp, stream)

    return {
        "view_id": "ner_manual",  # Annotation interface to use
        "dataset": dataset,  # Name of dataset to save annotations
        "stream": stream,  # Incoming stream of examples
        "exclude": exclude,  # List of dataset names to exclude
        "config": {  # Additional config settings, mostly for app UI
            "lang": nlp.lang,
            "labels": label, "exclude_by": "input" # Selectable label options
        },
    }

So this works

PRODIGY_LOGGING=verbose python3 -m prodigy ner.custom my_dataset xx_ent_wiki_sm my_file.csv -F ner_custom.py --label my_labels

And this works, but it doesn't suit me because of the tokenisation

PRODIGY_LOGGING=verbose python3 -m prodigy ner.manual my_dataset xx_ent_wiki_sm my_file.csv --label my_labels --patterns my_patterns.jsonl

While this doesn't

PRODIGY_LOGGING=verbose python3 -m prodigy ner.custom my_dataset xx_ent_wiki_sm my_file.csv -F ner_custom.py --label my_labels --patterns my_patterns.jsonl

With the following logging:

Any help much appreciated

ines · April 7, 2020, 9:30am

What do you mean by "doesn't load anything"? Does it show you "no tasks available"?

kak-to-tak · April 7, 2020, 1:42pm

it shows "Something went wrong"

ines · April 7, 2020, 3:06pm

And is there an error in the terminal? Typically, that should tell you the reason the server stopped.

kak-to-tak · April 7, 2020, 3:26pm

The last logs are in the screenshot which I've attached to a question. Nothing more was printed out.

kak-to-tak · April 7, 2020, 7:11pm

I managed to solve a packaging model with custom tokenizer so that one is not a problem anymore, but I tried to use it in a ner.manual recipe and the same problem occurs as with a custom recipe above.
The following command

python3 -m prodigy ner.manual my_dataset my_custom_packaged_model my_file.csv --label my_labels --patterns my_patterns.jsonl

is executed resulting in "Oops, something went wrong :("
with this logging which shows no particular error:

INFO:     198.16.66.155:30881 - "GET /project HTTP/1.1" 200 OK
22:06:58: POST: /get_session_questions
22:06:58: FEED: Finding next batch of questions in stream
22:06:58: PREPROCESS: Tokenizing examples
22:06:58: FILTER: Filtering duplicates from stream
{'by_input': True, 'by_task': True, 'stream': <generator object at 0x7f1bdd970948>, 'warn_fn': <bound method Printer.warn of <wasabi.printer.Printer object at 0x7f1bfecf3080>>, 'warn_threshold': 0.4}

22:06:58: FILTER: Filtering out empty examples for key 'text'

ines · April 7, 2020, 10:05pm

Hmm, strange Which version of Prodigy are you using? And if you open the developer console in your browser, can you see an error message there?

kak-to-tak · April 8, 2020, 5:22am

I'm using prodigy.1.9.9
Twice there was

23:50:37: RESPONSE: /get_session_questions (10 examples)

printed with the example followed but nothing appeared in the interface and then

Invalid HTTP request received.
INFO:     208.91.109.18:56387 - "HEAD /robots.txt HTTP/1.0" 404 Not Found
Invalid HTTP request received.
INFO:     78.133.45.248:33816 - "GET / HTTP/1.1" 200 OK

I also tried to use this model for train ner recipe and it is stuck at

result = self._query(query)
✔ Loaded model 'xx_my_model_name'

stage and nothing happens in another 10 minutes now without any error message or anything at all

kak-to-tak · April 8, 2020, 7:38am

python3 -m prodigy train ner my_dataset xx_ent_wiki_sm --output my_best_model --eval-split 0.2 --n-iter 5 --batch-size 5

works fine, while

python3 -m prodigy train ner my_dataset xx_my_model_name --output my_best_model --eval-split 0.2 --n-iter 5 --batch-size 5

breaks with

  result = self._query(query)
✔ Loaded model 'xx_my_model_name'
Created and merged data for 22 total examples
Using 18 train / 4 eval (split 20%)
Component: ner | Batch size: 5 | Dropout: 0.2 | Iterations: 5
Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/me/prodigy_/prod/lib/python3.6/site-packages/prodigy/__main__.py", line 60, in <module>
    controller = recipe(*args, use_plac=True)
  File "cython_src/prodigy/core.pyx", line 213, in prodigy.core.recipe.recipe_decorator.recipe_proxy
  File "/home/me/prodigy_/prod/lib/python3.6/site-packages/plac_core.py", line 367, in call
    cmd, result = parser.consume(arglist)
  File "/home/me/prodigy_/prod/lib/python3.6/site-packages/plac_core.py", line 232, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "/home/me/prodigy_/prod/lib/python3.6/site-packages/prodigy/recipes/train.py", line 149, in train
    baseline = nlp.evaluate(eval_data)
  File "/home/me/prodigy_/prod/lib/python3.6/site-packages/spacy/language.py", line 692, in evaluate
    gold = GoldParse(doc, **gold)
  File "gold.pyx", line 808, in spacy.gold.GoldParse.__init__
IndexError: list index out of range

ines · April 9, 2020, 8:30am

I think there are several, potential unrelated things going on here.

If you open the developer console in your browser, does it show something there?

kak-to-tak:

File "/home/me/prodigy_/prod/lib/python3.6/site-packages/spacy/language.py", line 692, in evaluate 
    gold = GoldParse(doc, **gold) 
  File "gold.pyx", line 808, in spacy.gold.GoldParse.__init__ IndexError: list index out of range

That sounds like there's something very unexpected in the data here. What's in my_datasetand where does the data come from? Maybe try exporting it with db-out and see if there's anything in there that looks suspicious? For instance, do you have multiple examples of the same text with different tokenizations in there?

kak-to-tak · April 9, 2020, 11:56am

I will check this, thank you!

Topic		Replies	Views
Prodigy present text with no matching pattern (ner.manual) usage , ner , solved	5	463	April 12, 2020
ner.manual gives ValueError: Mismatched tokenization. usage , ner , solved	9	1415	August 1, 2019
How do I add a --patterns option to ner.make-gold? ner , solved	11	1809	October 25, 2018
Training NER model from scratch using (forward-looking) patterns usage	8	692	December 17, 2019
Creating a custom recipe to integrate bespoke model usage , ner , custom , solved	3	720	November 12, 2019

Custom ner recipe doesn't work with patterns

Related topics