Can we bring back --seeds for textcat.teach?

@claycwardell As Ryan said, thanks for the feedback on this :heart:. I agree that we should get updates to the videos and tutorials, we're working on that.

While I don't want to rule out that it's a simpler problem, here's a bit of context around why matching everything up with earlier versions isn't always easy.

Machine learning techniques have continued to develop over the time we've been developing Prodigy, and we've obviously wanted to continue to keep things moving forward. However sometimes a change that's better in total is worse for either a particular dataset, or even for a whole workflow in general.

I'll give you a quick sketch of the main development trend that's mattered here, going back to way deeper history than is really necessary :slightly_smiling_face:.

Early versions of spaCy (pre v2.0) used non-neural network statistical models, which relied on boolean indicator features weighed by linear models. The linear models actually performed fairly well for English, and were very fast. Downsides included high memory usage, poor cross-domain transfer, and inability to use transfer learning.

Transfer learning is a big advantage in Prodigy's context, because it greatly reduces the total number of annotations needed to get to some level of accuracy. The simplest type of transfer learning is pretrained word vectors. Over the last couple of years, transfer learning of contextual word representations also works extremely well.

However, a neural network with transfer learning behaves very differently from an indicator-based linear model at the start of training. Neural networks start with a random initialization, and it takes a few batches of updates to move them into something helpful. Optimizers also work best if you let them take large steps at the beginning of training. There's certainly a big literature on online learning where the cost of every update matters. But the architectures and optimizers there are different, so it's difficult to reuse work from elsewhere.

Beta versions of Prodigy were developed against spaCy v1.0, but we've been using the neural network models since the earliest releases. To make the textcat.teach recipe work, the trick that I developed was to use an ensemble. The ensemble combines predictions from a linear model and a neural network. The idea is that the linear model learns quickly from its sparse features. It starts off not knowing anything about anything, and if it does see a (word, category) pair that it's seen an example of, it will reliably label the example with that category. So you get this nice responsivity at the beginning. Over time the neural network model then takes over.

The problem is that it's difficult to make sure that these dynamics play out as intended, robustly on a number of datasets, as spaCy versions develop and model architectures have continued to adapt. It's a difficult thing to unit test.

Again, I don't want to rule out a simpler problem, where it's something about labels not lining up, or an outright bug in the matching code, etc. But it's also possible that the machine learning dynamics are simply working a bit worse than they used to, or that they only work on some problems but not on others.

For Prodigy v2, we want to treat recipes like textcat.teach that need specific models and experimentation differently from how we've been doing them. We should break these recipes out and define their machine learning architectures together with the recipe, ideally also with project files that do the necessary experimentation. This will make things much more reproducible, and let users customize things on a per-use-case basis much more easily.

1 Like