Can we bring back --seeds for textcat.teach?

I understand that --patterns is supposed to be the more powerful version of --seeds. But, my experience in trying to use it, along with the experience of others documented here, suggests that it doesn't work as well in some cases. We know that --seeds worked pretty well, as it was the basis of this tutorial, where Ines shows us she can get great results in only an hour or so using --seeds. When I try to to emulate that tutorial using --patterns, by exporting the terms dataset to a patterns file, I get the issues that have been described by others: not showing the terms often enough, leading to unbalanced, mostly negative labels, eventually making the model converge towards zero.

There are work-arounds, but one simple possibility would be to bring back --seeds alongside the newer --patterns functionality and let users choose the one that works better for their task.

I wasn't around when this change was introduced (I believe it was v1.4.0 that made this change) but I think it is true that pattern files are more flexible. To quote the example from the docs:

{"pattern": [{"lemma": "acquire"}, {"pos": "PROPN"}], "label": "COMPANY_SALE"}
{"pattern": "acquisition", "label": "COMPANY_SALE"}

If we only had seed terms then we could only match on string matches, but these pattern files can also leverage lemmas, parts of speech, and other useful features from the token based matcher in spaCy.

I'd like to understand your problem a bit better so that I may give advice. What are you trying to predict? Is there anything special about your patterns file? What behavior are you seeing and what did you expect?

@koaning thank you for being on top of my (many) questions.

I think the best way to illustrate the difficulty is, try to follow along with Ines' tutorial here, but with patterns file instead of seeds. When you get to the part where she passes seeds to her classification model, you have to export your terms to a patterns file and pass that. What is interesting is when I did this, I didn't get the same relevant labelling examples she got when she passed seeds. I only got a few insult examples in hundreds of non-insults when I tried.

Maybe it's the seeds -> pattern transition, or maybe it's something else I'm doing wrong. A lot of the API has changed since she made that video. Maybe an easier ask, rather than bringing back --seeds, is this: can we get a binary textcat tutorial video or document, which achieves a great model like Ines', using the current version of prodigy and spacy? I could follow along from there figure out what I'm doing wrong.

hi @claycwardell!

Thanks for your comment and your feedback! We greatly appreciate users' ideas and I've written an internal note for our engineering team. They'll take this into consideration as I think we may be rethinking the design of some of the recipes too.

If you're not aware, you can look at Python recipes locally so you could even try to experiment. To do this, run python -m prodigy stats and find your Location:. From there, look for the recipes folder.

Thanks for this feedback too! I'll also forward this to our community team so we can consider this in the new year.

@claycwardell As Ryan said, thanks for the feedback on this :heart:. I agree that we should get updates to the videos and tutorials, we're working on that.

While I don't want to rule out that it's a simpler problem, here's a bit of context around why matching everything up with earlier versions isn't always easy.

Machine learning techniques have continued to develop over the time we've been developing Prodigy, and we've obviously wanted to continue to keep things moving forward. However sometimes a change that's better in total is worse for either a particular dataset, or even for a whole workflow in general.

I'll give you a quick sketch of the main development trend that's mattered here, going back to way deeper history than is really necessary :slightly_smiling_face:.

Early versions of spaCy (pre v2.0) used non-neural network statistical models, which relied on boolean indicator features weighed by linear models. The linear models actually performed fairly well for English, and were very fast. Downsides included high memory usage, poor cross-domain transfer, and inability to use transfer learning.

Transfer learning is a big advantage in Prodigy's context, because it greatly reduces the total number of annotations needed to get to some level of accuracy. The simplest type of transfer learning is pretrained word vectors. Over the last couple of years, transfer learning of contextual word representations also works extremely well.

However, a neural network with transfer learning behaves very differently from an indicator-based linear model at the start of training. Neural networks start with a random initialization, and it takes a few batches of updates to move them into something helpful. Optimizers also work best if you let them take large steps at the beginning of training. There's certainly a big literature on online learning where the cost of every update matters. But the architectures and optimizers there are different, so it's difficult to reuse work from elsewhere.

Beta versions of Prodigy were developed against spaCy v1.0, but we've been using the neural network models since the earliest releases. To make the textcat.teach recipe work, the trick that I developed was to use an ensemble. The ensemble combines predictions from a linear model and a neural network. The idea is that the linear model learns quickly from its sparse features. It starts off not knowing anything about anything, and if it does see a (word, category) pair that it's seen an example of, it will reliably label the example with that category. So you get this nice responsivity at the beginning. Over time the neural network model then takes over.

The problem is that it's difficult to make sure that these dynamics play out as intended, robustly on a number of datasets, as spaCy versions develop and model architectures have continued to adapt. It's a difficult thing to unit test.

Again, I don't want to rule out a simpler problem, where it's something about labels not lining up, or an outright bug in the matching code, etc. But it's also possible that the machine learning dynamics are simply working a bit worse than they used to, or that they only work on some problems but not on others.

For Prodigy v2, we want to treat recipes like textcat.teach that need specific models and experimentation differently from how we've been doing them. We should break these recipes out and define their machine learning architectures together with the recipe, ideally also with project files that do the necessary experimentation. This will make things much more reproducible, and let users customize things on a per-use-case basis much more easily.

1 Like

Cool, thanks for the explanation. My main take-away from that is: Spacy and Prodigy have become more powerful, generalizable, and efficient in general since the early days of Prodigy, but that doesn't mean that they will work better for every single use-case. And reading between the lines a bit, it seems that maybe one of the use-cases where the new versions of Spacy and Prodigy don't work as well as they used to is the subject of Ines' original textcat tutorial video, where you train a binary classifier on a relatively small amount of data using active learning.

Is that the correct interpretation here? If so, I'd have two main reactions:

  1. That's cool. That's how software works sometimes.
  2. A little more transparency around this fact, especially with regards to the docs and that original tutorial video, would be good.

We actually need to investigate more to make sure that this is the case for that specific tutorial. If that's the explanation, and it's not some sort of easily fixed problem, then yes we'll indeed update the docs and tutorial. I also agree that we should've looked more carefully at this sooner. Again, thanks for flagging it.

1 Like