Capacity of Prodigy

For textcat.teach with pattern and active learning, what's the maximum number of instances can Prodigy handle?
Is there any limitation for the size of text in each instance?
Is there any limitation for the number of patterns for textcat.teach?
Is there any limitation of Prodigy we should be aware if we plan to use it to annotate a large set of data?
Thank you.

What exactly do you mean by instance? Spinning up the server and starting Prodigy? There's not really a limit here – if so, it'd be limited by the machine you're running it on and the number of ports you're able to open. It's unlikely that you're going to be hitting problems here.

For textcat.teach with a model in the loop, one thing to keep in mind is that each instance will load the model and keep a second copy of it in memory. So if you are starting off with base models with word vectors etc. and are running several instances, you may run out of memory if your machine doesn't have enough RAM.

Not really. By default, Prodigy will read the input data as a stream if the file format allows it, so you're only ever loading a batch at a time and never have to load all your data into memory. That's also why we recommend JSONL as a data format: it's super flexible and can be read in line by line, so your stream can technically be infinite.

If the question is about text per annotation task: There's not really a limit either, although it's typically more efficient to annotate smaller chunks at a time, like sentences or paragraphs. Even if you're doing document classification, your model probably averages over sentence or paragraph predictions anyways, so there's no really a point in making annotators read whole documents at a time and only collecting one datapoint per document (with much higher potential for human error).

No, not really. It's all implemented via spaCy's matcher logic under the hood. Once you're in the tens of thousands for token-based patterns, or hundreds of thousands for string match patterns, matching may take a little bit longer (but not significantly). But this is also not a reasonable number of patterns to have, especially not during annotation.

It's not all specific to Prodigy, but here some general suggestions. This might already be obvious to you, but writing it down anyways (also in case others come across this thread later):

  • Store your data in a way / format that can be read as a stream. Either a database you can stream from (which you can then write a custom loader for), or a format like JSONL. If you're exporting data in intermediate formats to use for annotation, make sure to include metadata (like an internal database ID) that lets you map your annotations back to the original data later on.
  • Be clear about what your goals are: Do you want to use your large amounts of data to create a small, representative set and train a model? Do you want to annotate all records in your large dataset and create a corpus where every single text is labelled? (And if so, are you sure you need that?) Then choose the annotation workflow accordingly. Workflows like textcat.teach use the model in the loop to select examples to annotate and train from – but they're not going to be a good fit if you need every example annotated.
  • Make sure your label scheme and annotation manual are solid before you scale up. Ideally, spend some time with the data, train a prototype and do some error analysis. Prodigy should hopefully make it easy to do this. Even if the goal sounds super simple, there can be a lot of subtle ambiguities and specifics that make it more difficult for a model to learn the distinction. If you're not aware of these problems early and scale up, you can easily end up wasting lots of time and money on low quality data that you can't really use.

Thank you very much. Glad to hear that Prodigy has no scalability issues.