Error with pos.batch-train

Hi @ines/@honnibal,
I trained a PROPN tag for my dataset because I would like the model to recognize certain network entities as proper nouns. Since I had to use my tokenization scheme, I copied the pos.teach recipe from Prodigy over to my recipe and just edited it to use my custom tokenizer.

When I use pos.batch-train to train the model, it is failing with an error in the first iteration of the run. I looked on the forums, but the only other issue with a similar error was some issue that was sorted out in Prodigy 1.5.1.

[Abhishek:~] [NM-NLP] $ prodigy pos.batch-train net_pos_tags en_core_web_sm --output-model /tmp/models/net_labels_ner --eval-split 0.2

Loaded model en_core_web_sm
Using 20% of accept/reject examples (110) for evaluation
Using 100% of remaining examples (445) for training
Dropout: 0.2  Batch size: 4  Iterations: 10


BEFORE     0.169
Correct    13
Incorrect  64
Unknown    1298


#          LOSS       RIGHT      WRONG      ACCURACY
Traceback (most recent call last):
  File "/usr/local/Cellar/python/3.6.4_3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/local/Cellar/python/3.6.4_3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/Users/Abhishek/Projects/Python-Projects/Python-VEs/NM-NLP/lib/python3.6/site-packages/prodigy/__main__.py", line 259, in <module>
    controller = recipe(*args, use_plac=True)
  File "cython_src/prodigy/core.pyx", line 253, in prodigy.core.recipe.recipe_decorator.recipe_proxy
  File "/Users/Abhishek/Projects/Python-Projects/Python-VEs/NM-NLP/lib/python3.6/site-packages/plac_core.py", line 328, in call
    cmd, result = parser.consume(arglist)
  File "/Users/Abhishek/Projects/Python-Projects/Python-VEs/NM-NLP/lib/python3.6/site-packages/plac_core.py", line 207, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "/Users/Abhishek/Projects/Python-Projects/Python-VEs/NM-NLP/lib/python3.6/site-packages/prodigy/recipes/pos.py", line 236, in batch_train
    drop=dropout)
  File "cython_src/prodigy/models/pos.pyx", line 90, in prodigy.models.pos.Tagger.batch_train
  File "cython_src/prodigy/models/pos.pyx", line 136, in prodigy.models.pos.Tagger.update
  File "cython_src/prodigy/models/pos.pyx", line 156, in prodigy.models.pos.Tagger.inc_gradient
  File "cython_src/prodigy/models/pos.pyx", line 164, in prodigy.models.pos.Tagger._multilabel_log_loss
IndexError: index -3 is out of bounds for axis 0 with size 2

Section of code that I changed in the recipe:
BEFORE:

if tag_map is not None:
        tag_map = get_tag_map(tag_map)
model = Tagger(spacy.load(spacy_model), label=label, tag_map=tag_map)

AFTER:

    if tag_map is not None:
        tag_map = get_tag_map(tag_map)
    nlp = spacy.load(spacy_model)
    nlp.tokenizer = custom_tokenizer(nlp)
    model = Tagger(nlp, label=label, tag_map=tag_map)

Thanks in advance for your help.

Just to confirm: Without your modification, you don’t see the error?

@ines , That is correct. Without custom tokenization, the recipes work without an error.

Just in case you need it, tokenizer code:

prefix_re = re.compile(r'''^[\["']''')
suffix_re = re.compile(r'''[\]"',]$''')
infix_re = re.compile(r'''[~]''')


def custom_tokenizer(nlp):
    return Tokenizer(nlp.vocab,
                     prefix_search=prefix_re.search,
                     suffix_search=suffix_re.search,
                     infix_finditer=infix_re.finditer
                     )

@ines,
Sorry about this thread. There was a naming error in my recipe file. Since I was working on my own recipe file and had just 3 recipes, ner.teach, pos.teach, pos.make_gold, I mistakenly did not rename the teach recipe to differentiate between ner.teach and pos.teach. I think the problem was that ner.teach was the first recipe it encountered and tagged labels with the PROPN label.

Subsequently, when I used pos.batch-train, it failed against the unknown labels (or that is what I am guessing happened).

I fixed the naming issue and tested it out with custom tokenization and it seems to work fine for now. Will post back in case any other errors occur.

Sorry about the confusion.

Thanks for updating with your solution and glad you got it working!