Dep.Teach doesn't use same tokenenization as pretrained model

ines · February 1, 2019, 4:23pm

Hi! I always like reading about projects making use of custom factories and components

Just to confirm: Your custom model has additional pipeline components added that perform the merging etc. right? In general, Prodigy should never mess with the existing pipeline, especially not with custom components – workflows like yours are definitely something we've had in mind when designing the workflows.

This sets up Prodigy's built in dependency parsing model that scores the possible analyses of the text (so we can then filter and sort it by score and make the active learning possible). It also takes care of updating the model in the loop – so it takes the nlp object, makes a backup of it (model.orig_nlp) and then updates the other model in the loop.

I just had a look and I couldn't find anything that modifies the pipeline As a sanity check, you could try printing nlp.pipeline and model.orig_nlp.pipeline in the recipe and make sure that your component is in there?

One possible explanation: Could you check which version of spaCy you're running? I vaguely remember an issue we fixed where custom pipeline components weren't applied when you run nlp.pipe. Internally, Prodigy uses nlp.pipe a lot because it's more efficient. You could also try running the following in your Python interpreter:

docs = nlp.pipe(['small amount of simple appearing free fluid'])
doc = list(docs)[0]
for t in doc:
    print(t.text)

If this shows the same unmerged tokens, we've found the source of the problem. In that case, try upgrading to the latest stable version of spaCy and re-run the code again.

Topic		Replies	Views
model extraction from ( prodigy command vs custom model_train code ) and usage of it. done , spacy	1	480	June 25, 2018
Does Prodigy load pre-annotated data? usage , ner , solved	23	2637	October 25, 2018
Roadmap of having a unified model for tokenizing, NER and dependency parsing using Prodigy ner , spacy , custom , training	1	418	July 7, 2023
Prodigy is losing my tokeniser usage , spacy	2	419	February 18, 2022
Don't understand the label files from data-to-spacy usage , textcat	2	510	February 5, 2022

Dep.Teach doesn't use same tokenenization as pretrained model

Related topics