This is a cool idea and actually a pretty nice use case for Prodigy. Your plan sounds reasonable and I’m pretty confident you’ll be able to train a classifier to detect BAD_GRAMMAR
. But ultimately, it comes down to experimenting with different approaches.
I think the first step could be to narrow down the selection of examples you’re annotating. In order to train a classifier, you need enough “positive” examples to start with, so that the model can learn what you’re looking for (instead of only what you’re not looking for). You can include this logic in a custom recipe to filter your stream, or use it as a separate pre-processsing script that you run over your corpus to extract data for annotation.
spaCy’s rule-based matcher is pretty powerful and lets you create patterns as a list of dictionaries, with one dictionary describing a token. You can specify one or more token attributes and their values – for example, part-of-speech tags, dependencies, the lowercase or exact text, the token’s shape, or boolean flags like IS_PUNCT
, LIKE_NUM
etc. Here are some possible example patterns:
bad_grammar_patterns = [
[{'POS': 'NOUN'}, {'POS': 'NOUN'}], # two consecutive nouns
[{'LOWER': 'foo'}, {'LOWER': 'bar'}], # wrong spelling of "foobar" as two words
[{'ORTH': '.'}, {'IS_LOWER': True}] # a period followed by a lowercase word
]
Just be creative and try out different options Even if a pattern doesn’t always mean bad grammar, it can still be a good idea to include it. You’ll be annotating and accepting/rejecting the results anyways, and giving the model examples of combinations that are only sometimes correct in certain contexts might actually have a positive impact on accuracy. Also, when creating the patterns, make sure to double-check that spaCy tokenizes the text the way you think it does – especially if you’re including punctuation or more complex rules.
import spacy
from spacy.matcher import Matcher
nlp = spacy.load('your_nb_model')
matcher = Matcher(nlp.vocab)
matcher.add('BAD_GRAMMAR', None, *bad_grammar_patterns)
# do this for lots of documents
doc = nlp(SOME_NORWEGIAN_TEXT)
matches = matcher(doc)
If the Doc
contains a match, you can add its text to your list of annotation examples, and save them out in a format like JSONL to annotate them with Prodigy. Instead of using the textcat.teach
recipe straight away, you might want to try using the mark
recipe first to go through the examples without the active learning component. Scoring and resorting the stream will be useful later on – but for now, you just want to annotate as many examples as possible based on your patterns:
prodigy mark nb_bad_grammar data.jsonl --label BAD_GRAMMAR --view-id classification
Since the mark
recipe will just render the task as it comes in, you could also add a "spans"
property to the task to highlight the span of text matched by your list of bad grammar patterns. This can be useful to debug and improve the patterns, and makes it easier to see why that text was selected for annotation. The Matcher
gives you the start and end token of each match, so you can create the spans programmatically:
spans = []
for match_id, start, end in matches:
span = doc[start:end] # the matched slice of the doc
spans.append({'start': span.start_char, 'end': span.end_char})
Assuming the matcher rules are detailed enough and your data contains enough examples, you should be able to have a decent number of accept
s for your “bad grammar” dataset to get over the “cold start problem”. Not all matches are going to be examples of bad grammar – but this is actually very good, and potentially lets you find ambiguities and exceptions to the rules that are important to learn as well.
Once you’ve annotated a decent number of examples (like, a few hundred, ideally with at least 50% accept), you can run textcat.batch-train
and see what the model is learning so far. If the results look promising, you can also run textcat.train-curve
to show training results after using 25%, 50%, 75% and 100% of the training data. If you see an increase in accuracy within the last 25% (i.e. between 75% and 100% of the data), it’s likely that the accuracy will improve further if you collect more examples similar to the existing training data.
The next step could be to export your trained textcat model and improve it by running it over raw, unfiltered data using textcat.teach
. This means that you’re letting Prodigy use your pre-trained classifier to select examples from the stream, based on the BAD_GRAMMAR
scores it assigns for each text. By default, Prodigy will select the examples your model is most uncertain about, i.e. the ones with a prediction closest to 50/50.
The model you specify on the command line can be a model package, or a path to a model data directory. So you can simply point it to the directory created by the previous batch-train step:
prodigy textcat.teach nb_bad_grammar /path/to/model raw_data.jsonl --label BAD_GRAMMAR
What happens next is hard to predict and depends on your data. If accuracy improves further after training on the additional examples generated with textcat.teach
, you’re likely on the right track. If not, you might want to go back to the previous step, annotate more examples using the patterns and pre-train the model some more, before using the active learning-powered recipes.
I hope this was helpful so far. Good luck and definitely keep us updated on the progress!