Dep.Teach doesn't use same tokenenization as pretrained model

Hi! I always like reading about projects making use of custom factories and components :smiley:

Just to confirm: Your custom model has additional pipeline components added that perform the merging etc. right? In general, Prodigy should never mess with the existing pipeline, especially not with custom components – workflows like yours are definitely something we've had in mind when designing the workflows.

This sets up Prodigy's built in dependency parsing model that scores the possible analyses of the text (so we can then filter and sort it by score and make the active learning possible). It also takes care of updating the model in the loop – so it takes the nlp object, makes a backup of it (model.orig_nlp) and then updates the other model in the loop.

I just had a look and I couldn't find anything that modifies the pipeline :thinking: As a sanity check, you could try printing nlp.pipeline and model.orig_nlp.pipeline in the recipe and make sure that your component is in there?

One possible explanation: Could you check which version of spaCy you're running? I vaguely remember an issue we fixed where custom pipeline components weren't applied when you run nlp.pipe. Internally, Prodigy uses nlp.pipe a lot because it's more efficient. You could also try running the following in your Python interpreter:

docs = nlp.pipe(['small amount of simple appearing free fluid'])
doc = list(docs)[0]
for t in doc:
    print(t.text)

If this shows the same unmerged tokens, we've found the source of the problem. In that case, try upgrading to the latest stable version of spaCy and re-run the code again.