How to use customized spaCy model in Prodigy?

Hello :slight_smile:,

I am trying to use Prodigy for NER annotations. Firstly, I loaded the spaCy model de_core_news_lg nlp = spacy.load('de_core_news_lg') and modified the tokenizer of it by adding special cases plus I added a custom_sentencizer to separate sentences by some character of my choice. I removed the ner pipe of the model and lastly, I saved the NLP model to a directory by nlp.to_disk(path='tmp/my-model'). Then I ran the Prodigy from terminal by:

prodigy ner.manual data_set tmp/my-model ./dataset.jsonl --label ABSTRACTCLASS,IDENTIFIER,NUM,VALUE,UNIT --patterns ./ner_pattern.jsonl

Prodigy still does not recognize my added special cases of tokenization as tokens.

So, is my way right? or I missed something :confused:
thanks in advance,

hi @Rashid,

Thanks for your question.

Are you 100% sure that your model tmp/my-model is doing what you intended?

For example, what do you get if you run:

import spacy
nlp = spacy.load("tmp/my-model")

Also, to check your tokenizer works, can you run (replacing it with a sentence with your special cases of tokenization):

# using en_core_web_sm
nlp.tokenizer.explain("Here's a sentence I want to confirm the tokenizer does what I want.")
[('TOKEN', 'Here'), ('SUFFIX', "'s"), ('TOKEN', 'a'), ('TOKEN', 'sentence'), ('TOKEN', 'I'), ('TOKEN', 'want'), ('TOKEN', 'to'), ('TOKEN', 'confirm'), ('TOKEN', 'the'), ('TOKEN', 'tokenizer'), ('TOKEN', 'does'), ('TOKEN', 'what'), ('TOKEN', 'I'), ('TOKEN', 'want'), ('SUFFIX', '.')]

Prodigy's recipes like ner.manual are simply loading the pipeline like spacy.load(). So if you're having a problem with your spaCy pipeline, that'll be your problem. You may need to use spacy assemble to make sure the pipeline is assembled.

If you need help with customizing your spaCy pipeline, I'd recommend the spaCy GitHub discussions. That's where the spaCy core team can help.

Hello @ryanwesslen,

thanks for your reply.
Just to make it clear, I added some special cases this way:

nlp.tokenizer.add_special_case('-->', [{"ORTH": '-->'}]) 

shape_patterns = ["Xdd-Xddd", "Xdd-XXd", "Xdd-Xd", "Xdd-Xd:d", "Xdd-Xd:dd", "Xdd-XXd.d", "Xdd-XXdd.d", "Xdd-Xd:d", "Xdd-Xdd", "-dXd", "Xdd-Xdd:d"]
for pattern in shape_patterns:
    nlp.tokenizer.add_special_case(pattern, [{"ORTH": pattern}])

I tested my-model using nlp.tokenizer.explain and it gives me the result I want. like ('SPECIAL-1', '-->') and ('TOKEN', 'E21-Y01')

if I print the pipes names using nlp.pipe_names I get

['tok2vec', 'tagger', 'morphologizer', 'parser', 'lemmatizer', 'attribute_ruler']

Any suggestion?

Thanks in advance.


hi @Rashid,

Thanks for the background.

I'm still scratching my head because Prodigy recipes simply load whatever spaCy pipeline as you would normally do: spacy.load().

I've written this small script:

from prodigy.components.preprocess import add_tokens
import spacy
import srsly

# replace with your custom model
nlp = spacy.load("en_core_web_sm")
# can replace with example with input file: input.jsonl
stream = [{"text": "Hello world"}, {"text": "Another text"}]
# stream = srsly.read_jsonl("input.jsonl")
stream = add_tokens(nlp, stream, skip=True)

# print only 1 example

# if you want to save your file with tokens
# srsly.write_jsonl("my_tokens.jsonl", list(stream))
$ python scripts/
{'text': 'Hello world', 'tokens': [{'text': 'Hello', 'start': 0, 'end': 5, 'id': 0, 'ws': True}, {'text': 'world', 'start': 6, 'end': 11, 'id': 1, 'ws': False}]}

Can you run this to see if your model/examples get tokenized? If it does tokenizes as you expect, you can uncomment out the lines and save a file with your input data and with the tokens.

You can run a Prodigy command using the file with tokens (pre-tokenized) as your source input as Prodigy will use the tokens first.

Does this work?

Hi again @ryanwesslen

your code snippet worked on my data and tokenized correctly.

I am trying to save the data using srsly.write_jsonl("my_tokens.jsonl", stream) but I am getting an empty jsonl file :confused:

Thanks again :slight_smile:

Hi @Rashid,

Ah yes. If you run write_jsonl, make sure to comment out the print(list(stream)[0]). stream is a generator and after running list(stream) it'll make the generator empty.


# print(list(stream)[0])

srsly.write_jsonl("my_tokens.jsonl", stream)

Thank you for your reply,

Now I am facing another problem. I did as you explained lastly by creating my_tokens.jsonl file and then trained Prodigy. After the training, Prodigy saves the model (with ner and tok2vec components) in the directory. But the tokenizer I made in my customized model is not more saved in the same model.
I mean, now I have two models, the first one is my model where I modified the tokenizer by adding some special cases to it and the second one is the saved model of Prodigy. The saved model by Prodigy is doing good job by labeling the named entities but the tokenizing is not the same of my own custom tokenizer.
I tried to add the ner component to my model to be applied after the tokenizer, but I got false labeling.

I tried it the other way around, I add the tokenizer component of my model to the model created by Prodigy, but I still get false results.

So, what is the ideal way to let my model tokenize (according to my add cases) and then label the way of trained Prodigy model?

I hope I have explained my problem well :slight_smile: