Training a grammar tool

Hi

I’m working on a problem where i need to classify certain advanced grammatical errors. So not spelling mistakes, but more complex stuff like how a sentence is structured, comma rules and other things specific to the Norwegian language (Two words that should be concatenated into one, for example).

I’m hoping that it is possible to use some combination of tags, dependency trainer or text classifier to classify these mistakes.

For example in Norwegian there should never be two Nouns after one another, as they should always be concatenated. The same is sometimes true for Verb + Noun combinations because they are often words that are mistakenly split up. The first could be identified with a simple rule based on the tags (Noun + Noun), however the last must be trained and should look at both the tags and the dependencies.

I have just started with adding the a Norwegian model/parser to Spacy so i got the dep. and tags. But would it work to train a text classifier using this? Or should i create a custom dependency class?

Any hints on how to start on this problem would be much appreciated. I’m still reading/learning about Spacy and Prodigy.

I have also looked a bit into training a custom dependency tree, as was mentioned here in the support earlier. If there is a beta version with dependency annotation available and this would help i would really like to try it and give feedback if needed :slight_smile:

This is a cool idea and actually a pretty nice use case for Prodigy. Your plan sounds reasonable and I’m pretty confident you’ll be able to train a classifier to detect BAD_GRAMMAR. But ultimately, it comes down to experimenting with different approaches.

I think the first step could be to narrow down the selection of examples you’re annotating. In order to train a classifier, you need enough “positive” examples to start with, so that the model can learn what you’re looking for (instead of only what you’re not looking for). You can include this logic in a custom recipe to filter your stream, or use it as a separate pre-processsing script that you run over your corpus to extract data for annotation.

spaCy’s rule-based matcher is pretty powerful and lets you create patterns as a list of dictionaries, with one dictionary describing a token. You can specify one or more token attributes and their values – for example, part-of-speech tags, dependencies, the lowercase or exact text, the token’s shape, or boolean flags like IS_PUNCT, LIKE_NUM etc. Here are some possible example patterns:

bad_grammar_patterns = [
    [{'POS': 'NOUN'}, {'POS': 'NOUN'}],    # two consecutive nouns
    [{'LOWER': 'foo'}, {'LOWER': 'bar'}],  # wrong spelling of "foobar" as two words
    [{'ORTH': '.'}, {'IS_LOWER': True}]    # a period followed by a lowercase word
]

Just be creative and try out different options :smiley: Even if a pattern doesn’t always mean bad grammar, it can still be a good idea to include it. You’ll be annotating and accepting/rejecting the results anyways, and giving the model examples of combinations that are only sometimes correct in certain contexts might actually have a positive impact on accuracy. Also, when creating the patterns, make sure to double-check that spaCy tokenizes the text the way you think it does – especially if you’re including punctuation or more complex rules.

import spacy
from spacy.matcher import Matcher

nlp = spacy.load('your_nb_model')
matcher = Matcher(nlp.vocab)
matcher.add('BAD_GRAMMAR', None, *bad_grammar_patterns)

# do this for lots of documents
doc = nlp(SOME_NORWEGIAN_TEXT) 
matches = matcher(doc)

If the Doc contains a match, you can add its text to your list of annotation examples, and save them out in a format like JSONL to annotate them with Prodigy. Instead of using the textcat.teach recipe straight away, you might want to try using the mark recipe first to go through the examples without the active learning component. Scoring and resorting the stream will be useful later on – but for now, you just want to annotate as many examples as possible based on your patterns:

prodigy mark nb_bad_grammar data.jsonl --label BAD_GRAMMAR --view-id classification

Since the mark recipe will just render the task as it comes in, you could also add a "spans" property to the task to highlight the span of text matched by your list of bad grammar patterns. This can be useful to debug and improve the patterns, and makes it easier to see why that text was selected for annotation. The Matcher gives you the start and end token of each match, so you can create the spans programmatically:

spans = []
for match_id, start, end in matches:
    span = doc[start:end]  # the matched slice of the doc
    spans.append({'start': span.start_char, 'end': span.end_char})

Assuming the matcher rules are detailed enough and your data contains enough examples, you should be able to have a decent number of accepts for your “bad grammar” dataset to get over the “cold start problem”. Not all matches are going to be examples of bad grammar – but this is actually very good, and potentially lets you find ambiguities and exceptions to the rules that are important to learn as well.

Once you’ve annotated a decent number of examples (like, a few hundred, ideally with at least 50% accept), you can run textcat.batch-train and see what the model is learning so far. If the results look promising, you can also run textcat.train-curve to show training results after using 25%, 50%, 75% and 100% of the training data. If you see an increase in accuracy within the last 25% (i.e. between 75% and 100% of the data), it’s likely that the accuracy will improve further if you collect more examples similar to the existing training data.

The next step could be to export your trained textcat model and improve it by running it over raw, unfiltered data using textcat.teach. This means that you’re letting Prodigy use your pre-trained classifier to select examples from the stream, based on the BAD_GRAMMAR scores it assigns for each text. By default, Prodigy will select the examples your model is most uncertain about, i.e. the ones with a prediction closest to 50/50.

The model you specify on the command line can be a model package, or a path to a model data directory. So you can simply point it to the directory created by the previous batch-train step:

prodigy textcat.teach nb_bad_grammar /path/to/model raw_data.jsonl --label BAD_GRAMMAR

What happens next is hard to predict and depends on your data. If accuracy improves further after training on the additional examples generated with textcat.teach, you’re likely on the right track. If not, you might want to go back to the previous step, annotate more examples using the patterns and pre-train the model some more, before using the active learning-powered recipes.

I hope this was helpful so far. Good luck and definitely keep us updated on the progress!

1 Like

Thank you for a detailed and well written reply! I took a quick look at the patterns documentation and it seems really good :slight_smile:

I will try this in the next couple of days, and let you know how it goes :slight_smile:

I’m having some trouble with creating the correct patterns. Is there a way to match an array of different possibilities?

Here is an example:

I need to match a verb in inf form. that means either AUX of VERB with morph INF.

[
  {'TAG': 'VERB__VerbForm=Inf|Voice=Pass'},
  {'TAG': 'VERB__VerbForm=Inf'},
  {'TAG': 'AUX__VerbForm=Inf'}
]

or rather i need this pattern:

and_abs_rule = [
    [{"VERB_INF_FLAG": False}, {"LOWER": "og"}, {"VERB_INF_FLAG": True}],
]
matcher.add(AND_ERROR, None, *and_abs_rule)

I have tried all sorts of stuff to get this to work, but nothing is matching :-/

Unfortunately there are only so many slots in the TokenC struct, and the matcher doesn’t allow set values. So you would have to add rules for each combination of your tags. In this case you’d have 9 combinations, which isn’t terrible. In other situations the cross-product approach doesn’t work well.

There are a few workarounds you could try. It could be that the dependency label makes your rule easier to write. You could also simply write to the token.tag_ attribute, setting the tag to one which has the distinction you want. Finally, you could make the pattern more general than you need, and then filter out the incorrect matches in the on_match callback.

Another thing that might help you is the dependency pattern matcher contributed to spaCy 1.9, which we haven’t ported to v2 yet: https://github.com/explosion/spaCy/pull/1120 . This should be easy to port, and might make the patterns you want to define easier to write.

Finally, one thing to be aware of when doing error correction is that the parser might not behave reliably on text with grammatical errors. One way to address this would be to do data augmentation before training the parser, so that you introduce some examples of sentences with the errors.

Another idea that goes one step further is to try to get the parser to actually identify the errors for you, by packing the error tags into the dependency labels. It does seem like a good task for a joint model.

I feel like the matching patterns will always meet some sort of limitations. Are you considering the option of adding matchers with a matching function in stead? By that i mean in stead of only accepting a list of dictionaries. Allowing us to provide a function where we loop through the words our self?

Also another possible change to the matching logic here now would be to allow nested lists for alternative token matches. so something like this:

and_abs_rule = [[
    [{'TAG': 'VERB__VerbForm=Inf|Voice=Pass', "OP": "!"}, {'TAG': 'VERB__VerbForm=Inf', "OP": "!"}, {'TAG': 'AUX__VerbForm=Inf', "OP": "!"}],
    {"LOWER": "og"},
    [{'TAG': 'VERB__VerbForm=Inf|Voice=Pass'}, {'TAG': 'VERB__VerbForm=Inf'}, {'TAG': 'AUX__VerbForm=Inf'}]],
]]
matcher.add(AND_ERROR, None, *and_abs_rule)

I think for now i will go with the duplicate matches and just filter them out.

You should be able to do that already, although some of the details aren't documented -- here's what you need to know.

The Prodigy PatternMatcher object has a method add_matcher(), which takes a callable that has the same API as spaCy's Matcher. Specifically, the callable should take a doc object and return a list of (pattern_id, start_token, end_token) tuples. The tricky thing is the pattern_id needs to be a string formatted like {entity_label}-pattern-{index}, where index should be sequential. For instance, a well-formed pattern ID would be ORG-pattern-0.

You can add multiple matcher functions to a single PatternMatcher, which should make it easier to manage your matching logic.

Could you maybe provide an example of how to filter out results from matches using the on_match method?

I have tried this:

def filter_det_pos(matcher, doc, i, matches):
    match_id, start, end = matches[i]
    span = doc[start : end]
    if span[0].pos_ in ["PROP" "DET"]:
        print(span[0])
        del matches[i] # Tried deleing it but it is not the real matches result
        return False # i have trie returning false
        # return matches # i have trie returning the modified matches list

matcher.add(ADJ_NOUN_SPLIT_ERROR, filter_det_pos, *word_split_adj_noun_abs)

Sorry there was a typo in my code, for the if condition:

I though this worked, but now it does not… How can i delete a match from matches?

def filter_det_pos(matcher, doc, i, matches):
    match_id, start, end = matches[i]
    span = doc[start : end]
    if span[0].pos_ in ["PROP", "DET"]:
        del matches[i]

Hmm, the usability isn’t fantastic here. I think the problem is that the on_match callback is being called looping over the matches list. Modifying the list during this loop leads to the i index being incorrect. You used to be able to add an acceptor function to the pattern, but I removed this because I thought the on_match callback could be used for the same thing. Now I see that this is actually quite awkward.

What if instead of deleting the match, you replace it with a null value, such as matches[i] = (None, None, None). Then you would need to intercept the match list to drop these out as a post-process.

@ines and @honnibal I have started marking examples however based on the patterns i am able to create i still only find about 8 true cases for every 100 examples. So the training data that comes out is fairly skewed. After 500 examples i still only have about 43 true cases of word splitt errors.

Of course training on these 43 cases the model did not perform well. Basically it achieved about 86% accuracy by simply rejecting everything. So i need much more data, however do you have any suggestions about how to deal with data that is this skewed? Should i exclude some of the rejected examples to even the ratio?

@ohenrik It’s tough to advise confidently since I don’t know the specifics of the errors you’re identifying — so I’m worried I’ll suggest something that can’t work. So, take this with a grain of salt.

One option is to try data augmentation, if you think there’s a good way to generalize the pattern. The risk with data augmentation is that the model will learn your generation pattern, and exploit regularities in it rather than having to learn the distribution you’re actually interested in.

For instance, let’s say you were trying to find subject/verb agreement problems in English — so errors like “they was”. Instead of looking for natural cases of this error, we could introduce it. We could either transform grammatical text so it’s ungrammatical, or take the natural ungrammatical text, and swap one singular inflected verb for another. Hopefully the model would learn that if the subject is plural, any singular-marked verb is incorrect, while if you only have one example, the model doesn’t know whether the rule only applies to that verb.

Sometimes data augmentation can work well, but other times, it’s too hard to generate the interesting things you’d like your model to learn. For instance, let’s say you’re trying to find English text with incorrect use of “the”. Correct article usage in English is difficult precisely because the rules are really hard to pin down. If you just delete random “the” instances, you won’t generate errors that are at all like the ones people really make. The same is true if you add “the” before arbitrary noun phrases. So, sometimes generating convincing examples is almost as hard as identifying the errors in the first place.

Another alternative would be to work harder at your patterns. It might be that if you could just identify some class of noun, you could dramatically change your false positive rate. This would work especially well if the noun class could be identified based on the word vectors — then you could use terms.teach to get the noun class, and then add a set-membership flag to the lexicon, so you can use the flag in your matcher rules.

If neither of these work, then yes you could consider the distribution of examples in your dataset. I would suggest adding extra copies of the positive examples in preference to removing negative examples. If you double the positive examples, it’s sort of like making two passes over the data, with different negative examples in each pass. This is better than making two passes with the same negative examples.

When training with such small datasets, make sure you’re using pre-trained word vectors, and set the dropout very high. In your case, you should also take care to pass low_data=False to the text classifier. The low_data setting selects a model architecture with fewer hyper-parameters, which is often better when there are very few examples. However, the low_data=False architecture doesn’t have the convolutional layers, so the features are strictly unigram bag-of-words — a very poor fit for the grammar-based classification you’re interested in!

Thank you for a nice list of tips! :slight_smile: I started quickly by simply doubling the number of accepted answers since it was a quick fix compared to some of the other approaches. and oh boy did it work! I jumped from about 0.19 in f-score and 0.91 in accuracy up to about 0.88 in f-score and 0.97 in accuracy :slight_smile:

I will try some of the other approaches as well. I think i can fairly quickly take some of the sentences that where suggested and modify them so that they would have been positive (accept). That might give model some useful input.

Regarding the low_data flag. How do i set this? It does not seem to be an option through the Prodigy command line for for batch-train, is that solely for the ´teach´ command?

At the moment i use this command for batch training:

prodigy textcat.batch-train noun_noun_split_error_mod ./data/models/small7/model7 --output ./data/models/grammar3 -n 60

Also, regarding the output directory (grammar3), If i want to add another different model to that. Do i need to use grammar3 as input model and then create a new output directory (grammar4), or can i just set the output dir to the same model? (e.g. then adding to it)

Wohoo :tada: Nice to hear that it's working well so far! (You should consider writing a blog post about this once you're done btw – this is a really cool project and I'm sure a post about it would be very popular.)

The low_data flag is an attribute to the TextClassifier and currently not exposed via the command line. By default, textcat.batch-train should set it to True if you're training with < 1000 examples. But you can also change it manually by editing the recipe.

In theory, you could probably do that – but I wouldn't recommend it, because it'll overwrite your model. So if something goes wrong during training or you're not happy with the results, you'll lose the previous state of the model. (You can obviously always retrain with the same settings, but it'd still be frustrating.)

@ines thank you! I’ll consider writing a blog post about the project when its done :slight_smile:

At the moment i have about 2700 original accept/reject labeled examples. The problem is that it seems like the model takes the word (ORTH) too much into account when making predictions. Meaning it will rely on the word more than the tag of the word. Example:

# Both this and the one bellow should get a high probability
# as both are true cases of word split errors.
>>> nlp("vil du ha tomat supe").cats
{'NOUN_NOUN_SPLIT_ERROR': 0.021993301808834076}

# However "saus" is a word that exist in the original set of accepted answers.
# So this gets a much higher score.  
>>> nlp("vil du ha tomat saus").cats
{'NOUN_NOUN_SPLIT_ERROR': 0.6813089847564697}

So the sentences here are “I want tomato soup” and “i want tomato saus”. Both of these should be marked as NOUN_NOUN_SPLIT_ERROR since in Norwegian this should be “tomatsuppe” (TomatoSoup) and “tomatsaus” (TomatoSauce). The problem here is that “saus” (Sauce) is a word used in a accepted example. The original example is “pizza saus” (pizza sauce).

EDIT: In addition the presence of a word previously part of an accepted answer weight heavely on the score. This example is not a word split error:

>>> nlp("vil du ha pizza og bil").cats
{'NOUN_NOUN_SPLIT_ERROR': 0.14954939484596252}

(Do you want pizza and car) As you can see this gets a higher score than the true case above that does not have the previously used word “pizza” in it.

I think the initial boost in f-score caused by the doubling of accepted answers where due to the fact that the same word where used rather than learning from their tags.

Adding more data seems to work, however we are now on about 3200 examples and the learning curve is still increasing between 75% and 100% of the data used.

Do you have any suggestions on how to bend the focus over to focusing less on the ORTH and more on TAGs?

Base on the answer from @honnibal above i will try to create more examples (data augmentation). However i think the problem might be that if i create sentences with accepted answers where the two words that should be combined varieres but the sentence in general does not, the algorithm will recognize the sentence and not the tags of the words in question.

Edit: Could using too many iterations result in something similar to p-hacking while training? I have in many cases seen better results when increasing the number of iterations to 50 or 60. However i supect this will result in over fitting due the the amount of times i use the test data.

@ohenrik The default TextCategorizer model doesn’t use tags at all! Here’s a model which does include the TAG attribute in the features. I threw DEP in as well.

# coding: utf8
from __future__ import unicode_literals

# These imports are copied from spacy._ml -- not all are necessary.
import numpy
from thinc.v2v import Model, Maxout, Softmax, Affine, ReLu
from thinc.i2v import HashEmbed, StaticVectors
from thinc.t2t import ExtractWindow, ParametricAttention
from thinc.t2v import Pooling, sum_pool
from thinc.misc import Residual
from thinc.misc import LayerNorm as LN
from thinc.api import add, layerize, chain, clone, concatenate, with_flatten
from thinc.api import FeatureExtracter, with_getitem, flatten_add_lengths
from thinc.api import uniqued, wrap, noop
from thinc.linear.linear import LinearModel
from thinc.neural.ops import NumpyOps, CupyOps
from thinc.neural.util import get_array_module, copy_array
from thinc.neural._lsuv import svd_orthonormal
from thinc.neural.optimizers import Adam

from thinc import describe
from thinc.describe import Dimension, Synapses, Biases, Gradient
from thinc.neural._classes.affine import _set_dimensions_if_needed
import thinc.extra.load_nlp

from spacy.attrs import ID, ORTH, LOWER, NORM, PREFIX, SUFFIX, SHAPE, TAG, DEP
from spacy import util

def build_text_classifier(nr_class, width=64, **cfg):
    nr_vector = cfg.get('nr_vector', 5000)
    pretrained_dims = cfg.get('pretrained_dims', 0)
    with Model.define_operators({'>>': chain, '+': add, '|': concatenate,
                                 '**': clone}):

        # Define a vector table that embeds values from each of these columns. These are the columns used
        # By default in spaCy.
        cols = [ORTH, LOWER, PREFIX, SUFFIX, SHAPE, ID] 
        lower = HashEmbed(width, nr_vector, column=cols.index(LOWER))
        prefix = HashEmbed(width//2, nr_vector, column=cols.index(PREFIX))
        suffix = HashEmbed(width//2, nr_vector, column=cols.index(SUFFIX))
        shape = HashEmbed(width//2, nr_vector, column=cols.index(SHAPE))
        #### Define tables for our new features
        cols.extend([TAG, DEP])
        tag = HashEmbed(64, 500, column=cols.ndex(TAG))
        dep = HashEmbed(64, 500, column=cols.index(DEP))
        
        
        # Add the tag and dep features to the token embedding.
        # Note that there's an important change here from the model definition within spaCy.
        # spaCy's models wrap the vectors in the function `uniqued()`, which caches the vector
        # constructed for each word type in a batch. This works because if a word has the same ORTH,
        # it must necessarily have the same PREFIX, SUFFIX, SHAPE, etc. But this isn't true for TAG and DEP
        # so we must not cache the vectors.
        # Overall this layer takes a Doc object, extracts numeric IDs with doc.to_array(), embeds each ID into a vector using a separate table per ID, concatenates the vectors, and then uses a Maxout layer to reduce the dimensionality back down. Layer normalization is applied after the Maxout operation.
        vectors = (
            FeatureExtracter(cols)
            >> with_flatten(
                    (lower | prefix | suffix | shape | tag | dep) 
                    >> LN(Maxout(width, width+(width//2)*3 + 64 + 64)) 
            )
        )
        # Here we add features from pre-trained vectors as well.
        static_vectors = (
            SpacyVectors
            >> with_flatten(Affine(width, pretrained_dims))
        )
        vectors = concatenate_lists(vectors, static_vectors)
        vectors_width = width*2
        
        # Now that we have our word representations, we pass them through the CNN. You might want to try changing this to be deeper --- that would give you more context.
        model = (
            vectors
            >> with_flatten(
                LN(Maxout(width, vectors_width))
                >> Residual(
                    (ExtractWindow(nW=1) >> LN(Maxout(width, width*3)))
                ) ** 2, pad=2
            >> flatten_add_lengths
            >> ParametricAttention(width)
            >> Pooling(sum_pool)
            >> Residual(zero_init(Maxout(width, width)))
            >> zero_init(Affine(nr_class, width, drop_factor=0.0))
            >> logistic
        )
    # Set the output dimension for the model, for future reference.
    model.nO = nr_class
    return model

I haven’t tested this yet, so I hope I haven’t messed something up.

The layers of indirection in Prodigy go like this. We’re creating a prodigy.models.TextClassifier object, which calls into our nlp object. The nlp object needs a pipe under the name "textcat". By default that’s mapped to the spacy.pipe.TextCategorizer class. That class needs an attribute .model, that will hold the output of the function above.

the spacy.pipeline.TextCategorizer class can be given an argument model on initialization. Alternatively you can subclass it, and overwrite its Model classmethod, to return the model instance you want. Another way is to simply assign directly, like nlp.get_pipe('textcat').model = build_text_classifier()

Thank you for the example. I’m trying to implement it now, however quick question. what is nr_class?

I was thinking about using this method:

nlp.get_pipe('textcat').model = build_text_classifier()

However it requires that i provide a nr_class, but i have no idea what this arguments does or is supposed to be.

EDIT: I think i figured out that nr_class is for number of classes to output. Correct? So that should be 1 for this case (only categorizing NOUN_NOUN_SPLIT_ERROR)

EDIT 2:

There where some imports missing to get the model working. I used these:
from spacy._ml import concatenate_lists, SpacyVectors, zero_init, logistic

And a type in the code (missing parenthesis):

def build_text_classifier(nr_class, width=64, **cfg):
    nr_vector = cfg.get('nr_vector', 5000)
    pretrained_dims = cfg.get('pretrained_dims', 0)
    
    with Model.define_operators({'>>': chain, '+': add, '|': concatenate,
                                 '**': clone}):

        # Define a vector table that embeds values from each of these columns. These are the columns used
        # By default in spaCy.
        cols = [ORTH, LOWER, PREFIX, SUFFIX, SHAPE, ID]
        lower = HashEmbed(width, nr_vector, column=cols.index(LOWER))
        prefix = HashEmbed(width//2, nr_vector, column=cols.index(PREFIX))
        suffix = HashEmbed(width//2, nr_vector, column=cols.index(SUFFIX))
        shape = HashEmbed(width//2, nr_vector, column=cols.index(SHAPE))
        #### Define tables for our new features
        cols.extend([TAG, DEP])
        tag = HashEmbed(64, 500, column=cols.index(TAG))
        dep = HashEmbed(64, 500, column=cols.index(DEP))


        # Add the tag and dep features to the token embedding.
        # Note that there's an important change here from the model definition within spaCy.
        # spaCy's models wrap the vectors in the function `uniqued()`, which caches the vector
        # constructed for each word type in a batch. This works because if a word has the same ORTH,
        # it must necessarily have the same PREFIX, SUFFIX, SHAPE, etc. But this isn't true for TAG and DEP
        # so we must not cache the vectors.
        # Overall this layer takes a Doc object, extracts numeric IDs with doc.to_array(), embeds each ID into a vector using a separate table per ID, concatenates the vectors, and then uses a Maxout layer to reduce the dimensionality back down. Layer normalization is applied after the Maxout operation.
        vectors = (
            FeatureExtracter(cols)
            >> with_flatten(
                    (lower | prefix | suffix | shape | tag | dep)
                    >> LN(Maxout(width, width+(width//2)*3 + 64 + 64))
            )
        )
        # Here we add features from pre-trained vectors as well.
        static_vectors = (
            SpacyVectors
            >>with_flatten(Affine(width, pretrained_dims))
        )
        vectors = concatenate_lists(vectors, static_vectors)
        vectors_width = width*2

        # Now that we have our word representations, we pass them through the CNN. You might want to try changing this to be deeper --- that would give you more context.
        model = (
            vectors
            >> with_flatten(
                LN(Maxout(width, vectors_width))
                >> Residual(
                    (ExtractWindow(nW=1) >> LN(Maxout(width, width*3)))
                ) ** 2, pad=2)
            >> flatten_add_lengths
            >> ParametricAttention(width)
            >> Pooling(sum_pool)
            >> Residual(zero_init(Maxout(width, width)))
            >> zero_init(Affine(nr_class, width, drop_factor=0.0))
            >> logistic
            )

    # Set the output dimension for the model, for future reference.
    model.nO = nr_class
    return model

After running this a few iterations it seems like the i have failed to attach the model correctly. as the model never improves. So this is not working:

model = TextClassifier(nlp, labels, long_text=long_text,
                       low_data=len(examples) < 1000)
nlp.get_pipe('textcat').model = build_text_classifier(1)

Here is the complete batch classifier recipie (Nothing except for one line is changed):

@recipe('textcat.batch-train-tag',
        dataset=recipe_args['dataset'],
        input_model=recipe_args['spacy_model'],
        output_model=recipe_args['output'],
        lang=recipe_args['lang'],
        factor=recipe_args['factor'],
        dropout=recipe_args['dropout'],
        n_iter=recipe_args['n_iter'],
        batch_size=recipe_args['batch_size'],
        eval_id=recipe_args['eval_id'],
        eval_split=recipe_args['eval_split'],
        long_text=("Long text", "flag", "L", bool),
        silent=recipe_args['silent'])
def batch_train(dataset, input_model=None, output_model=None, lang='en',
                factor=1, dropout=0.2, n_iter=10, batch_size=10,
                eval_id=None, eval_split=None, long_text=False, silent=False):
    """
    Batch train a new text classification model from annotations. Prodigy will
    export the best result to the output directory, and include a JSONL file of
    the training and evaluation examples. You can either supply a dataset ID
    containing the evaluation data, or choose to split off a percentage of
    examples for evaluation.
    """
    log("RECIPE: Starting recipe textcat.batch-train", locals())
    DB = connect()
    print_ = get_print(silent)
    random.seed(0)
    if input_model is not None:
        nlp = spacy.load(input_model, disable=['ner'])
        print_('\nLoaded model {}'.format(input_model))
    else:
        nlp = spacy.blank(lang, pipeline=[])
        print_('\nLoaded blank model')
    examples = DB.get_dataset(dataset)
    labels = {eg['label'] for eg in examples}
    labels = list(sorted(labels))

    model = TextClassifier(nlp, labels, long_text=long_text,
                           low_data=len(examples) < 1000)
   
    # This is where the change is!!!!
    nlp.get_pipe('textcat').model = build_text_classifier(1)
   

    log('RECIPE: Initialised TextClassifier with model {}'
        .format(input_model), model.nlp.meta)
    random.shuffle(examples)
    if eval_id:
        evals = DB.get_dataset(eval_id)
        print_("Loaded {} evaluation examples from '{}'"
               .format(len(evals), eval_id))
    else:
        examples, evals, eval_split = split_evals(examples, eval_split)
        print_("Using {}% of examples ({}) for evaluation"
               .format(round(eval_split * 100), len(evals)))
    random.shuffle(examples)
    examples = examples[:int(len(examples) * factor)]
    print_(printers.trainconf(dropout, n_iter, batch_size, factor,
                              len(examples)))
    if len(evals) > 0:
        print_(printers.tc_update_header())
    best_acc = {'accuracy': 0}
    best_model = None
    if long_text:
        examples = list(split_sentences(nlp, examples))
    for i in range(n_iter):
        loss = 0.
        random.shuffle(examples)
        for batch in cytoolz.partition_all(batch_size,
                                           tqdm.tqdm(examples, leave=False)):
            batch = list(batch)
            loss += model.update(batch, revise=False, drop=dropout)
        if len(evals) > 0:
            with nlp.use_params(model.optimizer.averages):
                acc = model.evaluate(tqdm.tqdm(evals, leave=False))
                if acc['accuracy'] > best_acc['accuracy']:
                    best_acc = dict(acc)
                    best_model = nlp.to_bytes()
            print_(printers.tc_update(i, loss, acc))
    if len(evals) > 0:
        print_(printers.tc_result(best_acc))
    if output_model is not None:
        if best_model is not None:
            nlp = nlp.from_bytes(best_model)
        msg = export_model_data(output_model, nlp, examples, evals)
        print_(msg)
    return best_acc['accuracy']

Hm. Are you using the longtext mode? If so, I see I used the name ‘doccat’ for the pipeline component, not textcat. That could be the problem — try this:


model.get_pipe().model = build_text_classifier(1)

This makes sure we get the right component from the nlp object.

If you’re not using long text mode, and it still doesn’t work, try this recipe:

import spacy.pipeline

class GrammarTextClassifier(spacy.pipeline.TextCategorizer):
    @classmethod 
    def Model(nr_class=1, width=64, **cfg):
        print("Building custom textcat model")
        return build_text_classifier(nr_class=nr_class, width=width, **cfg)

@recipe('textcat.batch-train-tag',
        dataset=recipe_args['dataset'],
        input_model=recipe_args['spacy_model'],
        output_model=recipe_args['output'],
        lang=recipe_args['lang'],
        factor=recipe_args['factor'],
        dropout=recipe_args['dropout'],
        n_iter=recipe_args['n_iter'],
        batch_size=recipe_args['batch_size'],
        eval_id=recipe_args['eval_id'],
        eval_split=recipe_args['eval_split'],
        long_text=("Long text", "flag", "L", bool),
        silent=recipe_args['silent'])
def batch_train(dataset, input_model=None, output_model=None, lang='en',
                factor=1, dropout=0.2, n_iter=10, batch_size=10,
                eval_id=None, eval_split=None, long_text=False, silent=False):
    """
    Batch train a new text classification model from annotations. Prodigy will
    export the best result to the output directory, and include a JSONL file of
    the training and evaluation examples. You can either supply a dataset ID
    containing the evaluation data, or choose to split off a percentage of
    examples for evaluation.
    """
    log("RECIPE: Starting recipe textcat.batch-train", locals())
    DB = connect()
    print_ = get_print(silent)
    random.seed(0)
    if input_model is not None:
        nlp = spacy.load(input_model, disable=['ner'])
        print_('\nLoaded model {}'.format(input_model))
    else:
        nlp = spacy.blank(lang, pipeline=[])
        print_('\nLoaded blank model')
    examples = DB.get_dataset(dataset)
    labels = {eg['label'] for eg in examples}
    labels = list(sorted(labels))

    # This is where the change is!!!!
    # We're telling spaCy what function to call to create a component named 'textcat'.
    nlp.factories['textcat'] = lambda nlp: GrammarTextClassifier(nlp.vocab, **cfg)

    model = TextClassifier(nlp, labels, long_text=long_text,
                           low_data=len(examples) < 1000)
   
    log('RECIPE: Initialised TextClassifier with model {}'
        .format(input_model), model.nlp.meta)
    random.shuffle(examples)
    if eval_id:
        evals = DB.get_dataset(eval_id)
        print_("Loaded {} evaluation examples from '{}'"
               .format(len(evals), eval_id))
    else:
        examples, evals, eval_split = split_evals(examples, eval_split)
        print_("Using {}% of examples ({}) for evaluation"
               .format(round(eval_split * 100), len(evals)))
    random.shuffle(examples)
    examples = examples[:int(len(examples) * factor)]
    print_(printers.trainconf(dropout, n_iter, batch_size, factor,
                              len(examples)))
    if len(evals) > 0:
        print_(printers.tc_update_header())
    best_acc = {'accuracy': 0}
    best_model = None
    if long_text:
        examples = list(split_sentences(nlp, examples))
    for i in range(n_iter):
        loss = 0.
        random.shuffle(examples)
        for batch in cytoolz.partition_all(batch_size,
                                           tqdm.tqdm(examples, leave=False)):
            batch = list(batch)
            loss += model.update(batch, revise=False, drop=dropout)
        if len(evals) > 0:
            with nlp.use_params(model.optimizer.averages):
                acc = model.evaluate(tqdm.tqdm(evals, leave=False))
                if acc['accuracy'] > best_acc['accuracy']:
                    best_acc = dict(acc)
                    best_model = nlp.to_bytes()
            print_(printers.tc_update(i, loss, acc))
    if len(evals) > 0:
        print_(printers.tc_result(best_acc))
    if output_model is not None:
        if best_model is not None:
            nlp = nlp.from_bytes(best_model)
        msg = export_model_data(output_model, nlp, examples, evals)
        print_(msg)
    return best_acc['accuracy']

I dont think I’m using longtext mode. Also my text is one sentence at a time.

Btw, using the code you provided there is no difference between this and the normal text classifier. Also the print statement is never executed so it seems like the code is just ignored/not used.

If i use model.get_pipe().model = build_text_classifier(1) the result is different (so the custom builder is used), however then I’m stuck for a long time not converging on anything. Might it be that when i do this the correct model is used when doing the calculation, but not when evaluating the results meaning that the model can never get better? since what it uses to learn differs what it uses to measure its success?

Btw, if i let the model run for a long time (many iterations) it eventually starts to returning almost the same results as the if i weren’t using a custom text classification model. So it might do this since it never gets anywhere using the tags, and eventually ends up dropping the weight for TAG and DEP to 0.