Use custom tokenizer in data-to-spacy

I have a follow-up question to this topic. I used data-to-spacy with the following parameters

prodigy data-to-spacy output --ner my_dataset --ner-missing --base-model output/my_model/model-last -F functions.py

I'm including --base-model because my data was collected using a custom tokenizer. I used -F to point prodigy to that tokenizer although I didn't see this parameter documented anywhere.

I got 727 training examples and 285 evaluation examples from this.
Then when I train a spacy model with this data, I get poor results. 0.3 F1 score and a rising TOK2VEC and NER loss indicating something is very wrong.

I was able to train a spacy model with similar data collected using the ner_manual prodigy view. We collected about 800 examples and got 0.7 F1 with diminishing TOK2VEC and NER loss.

Both models are using very similar config files, using pretrained word vectors and tok2vec on a large dataset.

I suspect that data-to-spacy is somehow not picking up on my custom_tokenizer since when I run data-to-spacy without --base-model or -F (see below) I get similar performance (0.3 F1 with increasing losses)

prodigy data-to-spacy output --ner my_dataset --ner-missing

I'm not sure how to debug further. How can I verify if my tokenizer is being used or what parameter should I be using to pass the custom tokenizer?

Hi! The -F flag is typically just used by Prodigy to point to custom recipes and it's not officially used by the data-to-spacy workflow – although, I do wonder if it happens to accidentally work in this case, because all it really does is import the Python module under the hood :thinking: In any case, we probably need a more explicit --code argument here, similar to how spaCy handles it for training.

Also, assuming that your base model uses a custom tokenizer defined in the config which you provide via the Python file, you would have otherwise seen an error if the function wasn't found (because without it, the base model couldn't load).

I just had a look and it seems like you're right and the generated config currently doesn't port over the tokenizer. As an experiment, you can run prodigy stats to find the location of your Prodigy installation and find the generate_config function in recipes/train.py. It should include the following lines:

config = init_config(lang=lang, pipeline=pipes, optimize=optimize, gpu=gpu)
    if base_nlp is not None and base_name is not None:

After that, it'll port over the pipeline components from your base model, but before that, you could try adding the following and see if it resolves the problem:

    config["nlp"]["tokenizer"] = base_model.config["nlp"]["tokenizer"]

Yes, forgot to mention, without -F, I get an error saying the tokenizer isn't defined.

The code in generate_config is not exactly as you described. I see:

if base_nlp is not None and base_name is not None:
    config = base_nlp.config
    ...
else:
    config = init_config(lang=lang, pipeline=pipes, optimize=optimize, gpu=gpu)
return config

Anyway, I tried adding the line in and I didn't see any change in performance.

FYI: I'm on prodigy v1.11.0a7

If I understand correctly, when passing --base_model, the config will be completely copied from base_nlp. But I'm not sure how to properly import my tokenizer. As you said, -F seems to mysteriously stop data-to-spacy from complaining it can't find the tokenizer, but I'm still seeing poor results.

Hi Noah,

We're currently fixing how the sourcing from a base_model works, and will have a look specifically at a custom tokenizer and importing custom code.

In the meantime, just a few more ideas:

I suspect that data-to-spacy is somehow not picking up on my custom_tokenizer

You should be able to check this by manually inspecting the config file that is generated from the data-to-spacy command. If your custom tokenizer is not there, you can change it still before you run spacy train. In fact, I guess you can probably try changing the entire config to the one you used earlier (with the 70% F) and see how that works, just to determine whether the decreased performance is caused by a change in the config, or somehow a difference in the actual data in the .spacy files.

Thanks @SofieVL for the update. Any idea on timeline for the fix? I'm trying to plan how to move forward with our project.

I'm already using the same config that I was using earlier. To remove any doubt that the data is bad, I generated binary data from my original dataset (where I got 70% F1). I did this by splitting documents that had multiple spans and adding new randomly generated spans as "rejected" spans. I checked to make sure the "rejected" spans weren't in the original "accepted" spans. I would expect this task to be easier than a standard dataset since many of the rejected spans are nonsensical. Still, I got the same poor results.

I think we've gotten a few things mixed up in this thread. It wasn't clear to me that you were using data from ner.teach. Updating from these kind of "binary annotations" looks a bit different in the new spaCy v3, and it's one of the final things that we need to finalize in the Prodigy nightly release. So it is not entirely unexpected that this isn't working well yet - cf work in progress here: https://github.com/explosion/spaCy/pull/8106

All of this has nothing to do with the custom tokenizer and the data-to-spacy command though. We'll still fix the latter as discussed, ofcourse.

I can't comment on the timeline at this point, but both of these things are high up on our TODO list and actively being worked on as we speak.

Thank you!

Hi Noah,

One more question out of curiosity:

I'm including --base-model because my data was collected using a custom tokenizer.

How did you originally collect the data with Prodigy? What was the command you ran and how did you import your custom tokenizer for annotation?

Sure. Here's the command I ran:

prodigy my-recipe my_dataset --view-id ner -F functions.py

I simplified the command I ran by removing custom parameters I created in my-recipe. Those parameters essentially tell the recipe where to find the data and the model to use to tokenize. That model was trained using my custom tokenizer beforehand. functions.py contains the custom tokenizer.

1 Like

Ok, thanks! That explains it. Like Ines said, the -F wasn't originally designed for this, but it does work in this context. I think it'll be useful/consistent to also support the --code flag for these cases, as spaCy does.