Training a grammar tool

@ohenrik It’s tough to advise confidently since I don’t know the specifics of the errors you’re identifying — so I’m worried I’ll suggest something that can’t work. So, take this with a grain of salt.

One option is to try data augmentation, if you think there’s a good way to generalize the pattern. The risk with data augmentation is that the model will learn your generation pattern, and exploit regularities in it rather than having to learn the distribution you’re actually interested in.

For instance, let’s say you were trying to find subject/verb agreement problems in English — so errors like “they was”. Instead of looking for natural cases of this error, we could introduce it. We could either transform grammatical text so it’s ungrammatical, or take the natural ungrammatical text, and swap one singular inflected verb for another. Hopefully the model would learn that if the subject is plural, any singular-marked verb is incorrect, while if you only have one example, the model doesn’t know whether the rule only applies to that verb.

Sometimes data augmentation can work well, but other times, it’s too hard to generate the interesting things you’d like your model to learn. For instance, let’s say you’re trying to find English text with incorrect use of “the”. Correct article usage in English is difficult precisely because the rules are really hard to pin down. If you just delete random “the” instances, you won’t generate errors that are at all like the ones people really make. The same is true if you add “the” before arbitrary noun phrases. So, sometimes generating convincing examples is almost as hard as identifying the errors in the first place.

Another alternative would be to work harder at your patterns. It might be that if you could just identify some class of noun, you could dramatically change your false positive rate. This would work especially well if the noun class could be identified based on the word vectors — then you could use terms.teach to get the noun class, and then add a set-membership flag to the lexicon, so you can use the flag in your matcher rules.

If neither of these work, then yes you could consider the distribution of examples in your dataset. I would suggest adding extra copies of the positive examples in preference to removing negative examples. If you double the positive examples, it’s sort of like making two passes over the data, with different negative examples in each pass. This is better than making two passes with the same negative examples.

When training with such small datasets, make sure you’re using pre-trained word vectors, and set the dropout very high. In your case, you should also take care to pass low_data=False to the text classifier. The low_data setting selects a model architecture with fewer hyper-parameters, which is often better when there are very few examples. However, the low_data=False architecture doesn’t have the convolutional layers, so the features are strictly unigram bag-of-words — a very poor fit for the grammar-based classification you’re interested in!