terms.teach hangs indefinitely with a custom word vector model

terms
solved

#1

I have my custom spacy model uses custom word vectors.

The word vectors work fine in the spacy model:

nlp = spacy.load('test_model')
nlp.vocab.length
467868
nlp.vocab.vectors_length
300
nlp.vocab.has_vector('وحش')
True

Loading the same model with prodigy:

pgy terms.teach dataset test_model -s "وحش"
Initialising with 1 seed terms: وحش

  ✨  Starting the web server at http://localhost:8080 ...
  Open the app in your browser and start annotating!

All looks fine, no errors, but I don’t get any terms to annotate, it just hangs with a …Loading message on the webpage.

Any insight or further checks I can do would be greatly appreciated.

All data is in Arabic, but that doesn’t seem to be an issue with other parts of prodigy!

Spacy version: 2.0.18
Prodigy version: 1.6.1


(Ines Montani) #2

Hi! I think I know what might be going on here: When terms.teach loops over the vocabulary in the vocab, it does the following:

lexemes = [lex for lex in stream if lex.is_alpha and lex.is_lower]

is_lower actually returns False for the arabic tokens – it delegates to Python’s native islower(), which is also False. (Interestingly, Python’s isupper() is False, too. I guess all of this is logical, because there’s no uppercase and lowercase distinction in Arabic, right?)

So as a quick fix, removing and lex.is_lower in recipes/terms.py should do the trick.


#3

Hey Ines

Yes correct, there’s no upper/lower case in Arabic.
is_alpha returns True

Changed to:

lexemes = [lex for lex in stream if lex.is_alpha]

and it worked!! :muscle::muscle:
Thanks a lot Ines, appreciate the quick reply.

Side note: the prodigy package is written beautifully.


#4

Follow up question, does this also affect the patterns matching functionality:

{"label":"NEGATIVE","pattern":[{"lower":"قرف"}]}

As I don’t seem to be getting any pattern matches with:

prodigy textcat.teach classifier spacy_model data.txt --label NEGATIVE --patterns seed_file.jsonl

Checked with spacy & I do get matches with the LOWER pattern on Arabic:

matcher = Matcher(nlp.vocab)
pattern = [{'LOWER': "قرف"}]
matcher.add("HelloWorld", None, pattern)

doc = nlp('خر قرف شي طايس')
matches = matcher(doc)
for match in matches:
    print(match)

(Ines Montani) #5

In theory, it shouldn’t – matching on the lower attribute just means that the matcher will compare the lowercase forms of both tokens (which should be identical either way – I just checked for the string you provided and it seems like token.lower_ == token.text in the case of Arabic).

Under the hood, Prodigy calls into spaCy’s Matcher and PhraseMatcher. Can you double-check you’re on the latest Prodigy version and that the string you’ve tested in spaCy directly also appears in your data?

Because the textcat.teach recipe prioritises the most uncertain predictions, it’s possible that you won’t see all suggestions and all pattern matches. But this shouldn’t really be happening in the beginning. As a quick sanity check, you could also try running ner.match with your data and your patterns. This won’t filter the incoming examples and just show you all pattern matches in your data.


#6

I’ve tested the matching in Spacy and it works fine

I’m, on:
Spacy version: 2.0.18 & Prodigy version: 1.6.1
which are the latest

I also did get a couple of matches yesterday, but I had to go through ~500 tags to see two pattern matches.

Just tried this, working fine & picking up a pattern in every example.

So this must be something with how Prodigy is selecting the next example to show in textcat.teach as I’m expecting to see more pattern matches in the beginning of the tagging session which makes my session very unproductive as I have to reject a large amount of non-relevant examples.

Note: I’m using a custom language model & custom embeddings.


(Ines Montani) #7

Thanks for the updates! In general, Prodigy will try to give you a good mix of pattern matches (especially in the beginning) and predictions, slowly focusing more on predictions than matches. But depending on the frequency of the matches and what the model is already predicting, it’s possible that this currently doesn’t always produce enough matches.

If you have a decent amount of patterns, maybe it makes sense to start off by annotating only matches and move on to annotating with a model in the loop later? You could use ner.match or build a slightly modified version that outputs tasks in the text classificatio style (with a label on top and no label next to the span). See here for a simplified version of the recipe, or check out recipes/ner.py in your Prodigy installation.

If you set label_span=False and label_task=True on the PatternMatcher, it’ll produce text classification tasks (top-level label, no label on the span):

# Initialize the pattern matcher and load in the JSONL patterns
matcher = PatternMatcher(nlp, label_span=False, label_task=True).from_disk(patterns)

Make sure to also set 'view_id': 'classification' to use the text classification interface.


#8

Oh great okay that seems to work well:

@recipe('textcat.bootstrap',
        dataset=recipe_args['dataset'],
        spacy_model=recipe_args['spacy_model'],
        source=recipe_args['source'],
        api=recipe_args['api'],
        loader=recipe_args['loader'],
        patterns=recipe_args['patterns'],
        exclude=recipe_args['exclude'],
        resume=("Resume from existing dataset and update matcher accordingly",
                "flag", "R", bool))
def match(dataset, spacy_model, patterns, source=None, api=None, loader=None,
          exclude=None, resume=False):

    log("RECIPE: Starting recipe textcat.bootstrap", locals())
    DB = connect()
    # Create the model, using a pre-trained spaCy model.
    model = PatternMatcher(spacy.load(spacy_model), label_span=False, label_task=True).from_disk(patterns)
    log("RECIPE: Created PatternMatcher using model {}".format(spacy_model))
    if resume and dataset is not None and dataset in DB:
        existing = DB.get_dataset(dataset)
        log("RECIPE: Updating PatternMatcher with {} examples from dataset {}"
            .format(len(existing), dataset))
        model.update(existing)
    stream = get_stream(source, api=api, loader=loader, rehash=True,
                        dedup=True, input_key='text')
    return {
        'view_id': 'classification',
        'dataset': dataset,
        'stream': (eg for _, eg in model(stream)),
        'exclude': exclude
    }

But this doesn’t accept a label, so I’m assuming I have to run this first and then use textcat.teach to go over the examples collected here again