Yay, that's nice to hear! Your workflow looks really good so far, so definitely keep us updated about the results. Answers below:
Yes, that sound like a good plan. How well this will work obviously depends on that data you have etc. etc., but being able to pre-train a model is always nice, since you won't have to deal with the cold start problem.
Yes, the problem in your case is that you don't have any negative examples – and Prodigy is optimised to train from binary data and sparse annotations. So the model here simply learned that "every label is correct", which is true – but obviously not generalisable.
One solution would be to add negative examples, e.g. by swapping out the labels. But you'll probably find it more efficient to just use spaCy directly – here's a simple code example. (In spaCy v2.0, all components share the same training API, so you can also take inspiration from the other examples.) Once you have a pre-trained model that predicts something, you can load it with textcat.teach
and keep improving it on new data.
Yes, that's all correct. We usually write all our labels in caps, e.g. WONEN
, but this is only a stylistic thing and doesn't actually matter.
The terms dictionary could be very useful to bootstrap more training data and select examples from very large corpora. The textcat.teach
recipe supports a --patterns
argument that can point to a JSONL file of patterns that look like this:
{"label": "GERMANY", "pattern": [{"lower": "berlin"}]}
{"label": "USA", "pattern": "New York"}
The patterns can either be a list of dictionaries, with one dictionary describing a token and its attributes (just like the patterns for spaCy's rule-based Matcher
), or exact strings. Using the patterns, you can give examples of words that are likely indicators of a category (e.g. texts including "berlin" are likely about Germany). You may come across false positives, too – but this is good, because you also want your model to learn about those cases.
If you're working with Dutch text, you probably want to start off with the small Dutch model, nl_core_news_sm
. If you don't care about the other components (tagger, parser, NER) and only want to train the text classifier, you can also just save out a "blank" model instead:
import spacy
nlp = spacy.blank('nl')
Prodigy's annotation recipes can take the name of a model package or a path to a model directory – so you can simply pass in the directory containing the pre-trained model that you saved out.