Dep.Teach doesn't use same tokenenization as pretrained model

Hello, I just started using Prodigy, and I am trying to label data for dependency parsing. I have already trained some models using Spacy with manually labeled data and packaged the models for loading. My issue is that the label candidates don’t quite match up to how my factories tokenize the data.

Example (custom model saved under same name for small en model):

>>> import spacy
>>> nlp = spacy.load('en_core_web_sm')
loading data for finding
loading data for pathology
loading data for anatomy
loading data for negative
>>> doc = nlp('small amount of simple appearing free fluid')
>>> for t in doc:
...  print(t.text)
...
small
amount
of
simple
appearing
free fluid

How prodigy processes the sentence (same result with no -U):
python -m prodigy dep.teach data2 en_core_web_sm ./data.txt -U

Free fluid should be merged into one token by one of my factories, and I would think that this label candidate shouldn’t show up. All of my factories are before the parser component, and the parser was trained with that specific tokenization. Does Prodigy remove other factories?

I tried changing the dep.teach recipe to not disable the tagger since disabling the tagger had screwed up my factories before when training since they rely on lemmas, but I get the same results in Prodigy. Is there something else I should be doing? What does this line model = DependencyParser(nlp, label=label) do in dep.teach that’s different from nlp = spacy.load(spacy_model, disable=['ner'])?

Going through more examples, the labeling is decently accurate to what I trained the model for with my custom labels, it’s just Prodigy isn’t recognizing the merged tokens. It does, however, respect my custom Tokenizer as it keeps dashes in phrases that would normally split with the default en model. Ex: ‘cul-de-sac’

Hi! I always like reading about projects making use of custom factories and components :smiley:

Just to confirm: Your custom model has additional pipeline components added that perform the merging etc. right? In general, Prodigy should never mess with the existing pipeline, especially not with custom components – workflows like yours are definitely something we've had in mind when designing the workflows.

This sets up Prodigy's built in dependency parsing model that scores the possible analyses of the text (so we can then filter and sort it by score and make the active learning possible). It also takes care of updating the model in the loop – so it takes the nlp object, makes a backup of it (model.orig_nlp) and then updates the other model in the loop.

I just had a look and I couldn't find anything that modifies the pipeline :thinking: As a sanity check, you could try printing nlp.pipeline and model.orig_nlp.pipeline in the recipe and make sure that your component is in there?

One possible explanation: Could you check which version of spaCy you're running? I vaguely remember an issue we fixed where custom pipeline components weren't applied when you run nlp.pipe. Internally, Prodigy uses nlp.pipe a lot because it's more efficient. You could also try running the following in your Python interpreter:

docs = nlp.pipe(['small amount of simple appearing free fluid'])
doc = list(docs)[0]
for t in doc:
    print(t.text)

If this shows the same unmerged tokens, we've found the source of the problem. In that case, try upgrading to the latest stable version of spaCy and re-run the code again.

Hello Ines, thanks for the quick reply!
Yes, I have multiple factories using the Matcher class, some based on lemmas, some based on specific wording, and used to merge and label tokens (and eventually add function hooks) for entity recognition down the line.

Results of printing both nlp.pipe_names and model.orig_nlp.pipe_names:

['tagger', 'measure', 'finding', 'pathology', 'anatomy', 'negative', 'phrase', 'parser']
['tagger', 'measure', 'finding', 'pathology', 'anatomy', 'negative', 'phrase', 'parser']

Results of your code snippet. It has the same merged tokens unfortunately.:

>>> docs = nlp.pipe(['small amount of simple appearing free fluid'])
>>> doc = list(docs)[0]
>>> for t in doc:
...  print(t.text)
...
small
amount
of
simple
appearing
free fluid

I am using Python 3.7 with Spacy v2.0.18 which is the latest version according to git and pip. Do you mean using Spacy-Nightly? Unfortunately, I have issues with using Spacy-Nightly when training with custom word vectors and overall poor accuracy with my post here: https://github.com/explosion/spaCy/issues/3142. Although I have a bunch of questions with that post, getting more labelled data should probably be my first priority when trying to increase accuracy, hence purchasing Prodigy :slightly_smiling_face:

Any other ideas on how to keep merged tokens via custom components in Prodigy? Maybe something to do with stream? What does get_stream() pass back? Individual tokens or full sentences?:

return {
        'view_id': 'dep',
        'dataset': dataset,
        'stream': prefer_uncertain(model(stream)),
        'update': model.update,  # callback to update the model in-place
        'exclude': exclude,
        'config': {'lang': model.nlp.lang}
    }

Thanks for testing this! (I definitely meant the stable version 2.0.18, yes. The current Prodigy isn’t really ready for the nightly yet, at least not officially.)

But I think I figured it out, damn :woman_facepalming: If your rules are based on the lemmas, they’re also based on the tagger – because the lemmatizer needs part-of-speech tags (at least in English, where it’s rule-based). When creating the initial Doc objects to compute all possible analyses on, Prodigy does set disable=["ner", "parser", "tagger"] to speed things up. It doesn’t take into account that other components in the pipeline may rely on those to further modify the doc – like your component that merges based on lemmas. This is obviously bad.

As a quick workaround, you could try renaming the built-in tagger to something else, like tagger2 and adding a factory entry that resolves it to spaCy’s Tagger so your model pipeline can specify "tagger2", which won’t be affected when the "tagger" is disabled.

from spacy.language import Language
from spacy.pipeline import Tagger
Language.factories['tagger2'] = lambda nlp, **cfg: Tagger(nlp.vocab, **cfg)

We’ll think about how to best solve this in Prodigy’s internals. We’ve also been thinking about ways to make pipeline components in spaCy expose more information about them – like, what they modify and whether they depend on other components. At the moment, custom components are kind of a black box to spaCy, which makes it difficult to detect problems like this automatically.

Thanks for the explanation. I’ve been a bit busy, so I apologize for getting back to you so late. I tried doing your proposed solution, all I did was the following:

Rename the tagger before training:
nlp.rename_pipe('tagger','tagger2')

And when I package, I add this to the __init__.py along with the other factories.

from spacy.pipeline import Tagger
Language.factories['tagger2'] = lambda nlp, **cfg: Tagger(nlp.vocab, model=True, **cfg) 

Results:

>>> import spacy
>>> nlp = spacy.load('en_core_web_sm')
loading data for finding
loading data for pathology
loading data for anatomy
loading data for negative
>>> docs = nlp.pipe(['no free fluid within pelvis'])
>>> doc = list(docs)[0]
>>> for t in doc:
...  print(t.text)
...
no
free fluid
within
pelvis
>>> print(nlp.pipe_names)
['tagger2', 'measure', 'finding', 'pathology', 'anatomy', 'negative', 'phrase', 'parser']

Prodigy:
(printed pipe): ['tagger2', 'measure', 'finding', 'pathology', 'anatomy', 'negative', 'phrase', 'parser']
image

Unfortunately, I have the same problem. If I rename the component like that, will it still be picked up by disable? At first I tried writing a custom component, but the serialization/deserialization methods confused me for the Tagger (at least how it’s done in the source code), so I opted for the easy way out via nlp.rename_pipe(). Any other suggestions? Should I write the custom Tagger component?

Thanks for your patience on this! This is honestly super confusing – I could have sworn that it was related to the tagger being disabled :thinking: Don’t spend time trying a custom tagger, I doubt that this would make a difference. If your model’s pipeline specifies "tagger2", it definitely shouldn’t be affected if you disable the component "tagger". See here for the implementation – it really only checks for the component string names.

I think I’ll just need to try and reproduce this. In the case of “free fluid”, how does your component produce this match? Do you use the matcher/phrase matcher and then merge the matched span? Which attributes does it look at?

It’s definitely good that this came up, though, so we can come up with a solution for the more general case of Prodigy + spaCy model that merges tokens in the pipeline. Even if we get this figured out, there’s still a possibility that you’ll run into a problem when updating the parser in the loop. I’ll have to double-check this, but there might be logic where Prodigy assumes that nlp.make_doc (i.e. the tokenizer) will produce the same tokens as nlp or nlp.pipe.

Hi Ines,
Here is the specific factory that merges ‘free fluid’, I merge matches from the matcher, give them the noun tag, let the last token be the new lemma, and give it an entity label:

class FindingsRecognizer(object):
    # name of pipline element
    name = 'finding'
    # Add patterns here if supplied by add_pipe()
    def __init__(self, nlp, label='FIND', **cfg):
        self.label = label
        self.matcher = Matcher(nlp.vocab)
        self.patterns = []
        if cfg.get('data') and not self.patterns:
            for p in cfg.get('data'):
                self.patterns.append([{'LEMMA':t} for t in p.split()])
            self.matcher.add(self.label, None, *self.patterns)

        Token.set_extension('is_finding', default=False, force=True)
        Doc.set_extension('has_finding', getter=self.has_finding, force=True)
        Span.set_extension('has_finding', getter=self.has_finding, force=True)
    
    # attaches attribute if containing phrase
    def __call__(self, doc):
        doc.tensor = numpy.zeros((0,), dtype='float32')
        matches = self.matcher(doc)
        spans = []
        for matchid, start, end in matches:
            entity = Span(doc, start, end, label=matchid)
            try:
                spans.append(entity)
                for token in entity:
                    token._.set('is_finding', True)
                doc.ents = list(doc.ents) + [entity]
            except:
                continue
        for span in spans:
            span.merge('NN',span[-1].lemma_,span.label_)
        return doc

    # returns true if doc has finding token
    def has_finding(self, tokens):
        return any([t._.get('is_finding') for t in tokens])

    # when loading pretrained model
    def from_disk(self, path, **cfg):
        print("loading data for", self.name)
        with open(path / 'data.json') as f:
            self.patterns = json.load(f)
        self.matcher.add(self.label, None, *self.patterns)

    def to_disk(self, path, **cfg):
        if not path.exists():
            path.mkdir()
        with open(path / 'data.json', 'w+') as f:
            json.dump(self.patterns, f)

    def from_bytes(self, bytes_data):
        pass
    def to_bytes(self, **cfg):
        pass

The try-catch is for cases where it tries to merge a token that has already been merged (like having a rule for ‘Free Fluid’ and ‘Fluid’). Free Fluid is just one of the many finding names I store in a db where I store the full name and the lemma generated by the same base Tagger of the ‘en’ model to easily generate lemma matcher rules like this without needing a loaded model tagger:
[{'LEMMA': 'free'}, {'LEMMA': 'fluid'}]

This matcher only looks at lemmas. Is there another attribute I could use, or combination of rules to match on lemmas without the Tagger? I just would hate to have to save out terms like ‘free fluids’, ‘strandings’, densities’, etc… for all my terms, similarily for pathologies and anatomies.

If there’s any other code snippet you need, let me know.

I think we don’t really have a satisfying workflow for this in spaCy, generally, especially during training. The nlp.update() method assumes the components aren’t modifying the Doc object, so we can’t easily learn a pipeline with these dependencies between components. So, it does make sense that there are some difficulties here.

That said, I’m definitely confused by the specific behaviours here. I’ve reread the code as well, and as Ines said, I’d expect it to work with the tagger renamed to tagger2. Have you tried printing things within the __call__ methods of your components, just to check they’re being executed? If so, what was the result?

Hello, thanks for the insight.
I’m a bit lazy to describe where I put some print statements so I’ll post the __call__ with the print statements I made and some other adjustments :slight_smile: :

def __call__(self, doc):
        doc.tensor = numpy.zeros((0,), dtype='float32')
        matches = self.matcher(doc)
        spans = []
        for matchid, start, end in matches:
            entity = Span(doc, start, end, label=matchid)
            print('Found: ' + entity.text)
            try:
                spans.append((entity, entity.text))
                for token in entity:
                    token._.set('is_finding', True)
                doc.ents = list(doc.ents) + [entity]
            except Exception as e:
                print('Warning: ' + str(e))
                continue
        print('Available Spans: %s'%spans)
        for s, txt in spans:
            newtok = s.merge('NN',s[-1].lemma_,s.label_)
            if newtok:
                print('Merging: ' + newtok.text)
            else:
                print('Merging failed: ' + txt)
        return doc

Here’s normal use within python terminal:

>>> import spacy
>>> nlp = spacy.load('en_core_web_sm')
loading data for finding
loading data for pathology
loading data for anatomy
loading data for negative
>>> doc = nlp('no free fluid or fluid collection within pelvis')
Found: free fluid
Found: fluid
Warning: [E098] Trying to set conflicting doc.ents: '(1, 3, 'FIND')' and '(2, 3, 'FIND')'. A token can only be part of one entity, so make sure the entities you're setting don't overlap.
Found: fluid
Found: fluid collection
Warning: [E098] Trying to set conflicting doc.ents: '(4, 5, 'FIND')' and '(4, 6, 'FIND')'. A token can only be part of one entity, so make sure the entities you're setting don't overlap.
Available Spans: [(free fluid, 'free fluid'), (fluid, 'fluid'), (fluid, 'fluid'), (fluid collection, 'fluid collection')]
Merging: free fluid
Merging failed: fluid
Merging: fluid
Merging: fluid collection
>>> print([t.text for t in doc])
['no', 'free fluid', 'or', 'fluid collection', 'within', 'pelvis']
>>> print(nlp.pipe_names)
['tagger2', 'measure', 'finding', 'pathology', 'anatomy', 'negative', 'phrase', 'parser']

From Prodigy terminal:

(base) C:\Users\carlson.hang\Desktop\Code\DepTraining\Trainer\flaskr\data>python -m prodigy dep.teach test en_core_web_sm ./testdata.txt -U
loading data for finding
loading data for pathology
loading data for anatomy
loading data for negative
['tagger2', 'measure', 'finding', 'pathology', 'anatomy', 'negative', 'phrase', 'parser']
['tagger2', 'measure', 'finding', 'pathology', 'anatomy', 'negative', 'phrase', 'parser']
Added dataset test to database SQLite.

  ?  Starting the web server at http://localhost:8080 ...
  Open the app in your browser and start annotating!

10:17:26 - Task queue depth is 1
Found: free fluid
Found: fluid
Warning: [E098] Trying to set conflicting doc.ents: '(1, 3, 'FIND')' and '(2, 3, 'FIND')'. A token can only be part of one entity, so make sure the entities you're setting don't overlap.
Found: fluid
Found: fluid collection
Warning: [E098] Trying to set conflicting doc.ents: '(4, 5, 'FIND')' and '(4, 6, 'FIND')'. A token can only be part of one entity, so make sure the entities you're setting don't overlap.
Available Spans: [(free fluid, 'free fluid'), (fluid, 'fluid'), (fluid, 'fluid'), (fluid collection, 'fluid collection')]
Merging: free fluid
Merging failed: fluid
Merging: fluid
Merging: fluid collection

And unfortunately, here are the screenshots:
image
image

For this test, the only text in that file is the one shown. Again, I skip over the entity conflictions to prioritize labeling and merging the larger phrase. And my factory seems to work exactly as it would work in normal use. Very confusing as to how Prodigy determines what candidates to show.

If there’s anything else you would like me to test, let me know.

Currently, this is somewhat halting my project (unless I go back to manually labeling data), and I might try to change my solution to depend on single word tokens and utilize NER more, but I feel that having merged tokens (especially in medical terminology) and training word vectors on merged tokens would definitely be a good approach to my problem.

Edit: I also use a custom tokenizer class if that matters, it doesn’t seem to affect normal usage, but it might affect Prodigy? It essentially uses all the en defaults except I don’t split on ‘-’. I also realized all my factories utilizes lemmas in a way, but just to test something without lemmas, here’s a rule from my measure factory:
[{'LIKE_NUM': True}, {'LOWER': 'by', 'OP': '?'}, {'LOWER': 'x', 'OP': '?'}, {'LIKE_NUM': True}, {'LEMMA': 'centimeter'}]
And changed it to this without using lemma:
[{'LIKE_NUM': True}, {'LOWER': 'by', 'OP': '?'}, {'LOWER': 'x', 'OP': '?'}, {'LIKE_NUM': True}]
Here’s the result (ommiting the other stuff):

>>> print([t.text for t in doc])
['2 x 2', 'centimeter', 'free fluid', 'in', 'abdomen']

image

Do the above rules require the tagger too? I must be doing something wrong with Prodigy, but I’m not sure what that would be.

This is definitely very helpful – because this pretty much shows that the problem must be that the pipeline components just aren't run. All of these attributes are basic lexical attributes that don't depend on anything else.

Here's a hacky idea while we try to get to the bottom of this: So this is really quite hacky, but, in theory, you should be able to apply the tagger and your custom pipeline components within the custom tokenizer, after you've created the Doc object. So basically, like this:

def custom_hacky_tokenizer(text):
    tagger = nlp.create_pipe('tagger')
    tagger.from_disk('/path/to/model_dir/tagger')
    doc = your_regular_tokenizer(text)
    tagger(doc)
    one_of_your_custom_components(doc)
    return doc

Hi Ines, thanks for the workaround. It works!
image

The only factory in my pipeline now is the tokenizer, tagger, and parser. Of course, from all the attempts above, having the tagger doesn’t work, but I needed a place to extract the model for it from my tokenizer, so I saved a tagger with my model. This took me a while since I had to figure out how to initialize all my factories from within the tokenizer, which meant the tokenizer needed access to data for my factories, and so I had to write a custom tokenizer class with to_disk and from_disk methods etc… And initializing the tagger gave me a lot of issues for a while…

But at least it works! :smiley:

Let me know if you ever figure out the problem with initializing the factories the normal way, customizing the pipeline this way is definitely harder to read/understand. And thanks for helping me through this, you people do great work!

Edit: actually now that I think about it, I could’ve just invoked to_disk from_disk for all my factories from my tokenizer without needing to provide data to my tokenizer:man_facepalming:

Hello!
I have a similar problem with a spacy dep model for which I change a tokenizer for a custom one (with to_disk and from_disk methods) but dep.teach displays texts with a default one.
Here's what the pipe does

>>> for t in doc:
    print(t.text)
▁текстильный ▁материал ▁Primeknit ▁и ▁подошва ▁целиком , ▁включая ▁подмётк 
у ▁, ▁чей ▁коричневый ▁оттенок ▁последние ▁несколько ▁релизов ▁силуэта 
▁оставался ▁неизменным ▁. ▁_ ▁_ ▁_ ▁_ ▁_ ▁_ ▁_ ▁_ ▁_ ▁_ ▁_ ▁_ ▁_ ▁_ 
▁_ ▁_ ▁_ ▁_ ▁_ ▁_ ▁_ ▁_ ▁- ▁БЫЛО ▁1 8,990 Р + ▁ СТАЛО ▁18,490 Р ▁. 
▁

And this is what is displayed

@kak-to-tak How is your custom tokenizer implemented? Prodigy will use the model's nlp.make_doc method to create a tokenized Doc from the string of text. By default, this will call into nlp.tokenizer. So your custom tokenization should be implemented via the model's tokenizer.

Alternatively, you can also feed in pre-tokenized data that has a "tokens" property. See here for an example of the format: https://prodi.gy/docs/api-interfaces#dep

Thank you!

1 Like