Hmm.
Given the relative infrequency of classifiable paragraphs, how can I get around the problem of the classifier always predicting that a class is not present (and thus not finding positive examples to annotate)?
How much do you know about the target paragraphs? If you can come up with some keywords, maybe you can use them to bootstrap the classifier, so that you can get some positive examples into it at the start. Once you have some positive examples, I think the active learning should be able to help on this type of problem, because the model will be most unsure on examples where it's predicting the minority class.
Otherwise, you might find that it's actually more efficient to do the first few documents in a PDF reader or something, just so that it's easier to scroll around and skim to find the key paragraph. That might be a bit faster than enqueuing the paragraphs in Prodigy.
Am I likely to be better off using three distinct classes, or a single “A or not A” binary class, then using the resulting probabilities to decide (E.g. 1.0 -> A, 0.0 -> B, 0.5 -> mixed)?
I'd be tempted to try doing it as two problems: irrelevant vs relevant, and then A, B or mixed. The A, B, mixed classifier would only run on the paragraphs marked relevant. This might help with the class imbalance a bit, because you get to group together all relevant paragraphs. On the other hand, if the relevant paragraphs for A are very different from the relevant paragraphs for B, this approach might be worse.
The hierarchical approach might also be worse if there are actually subtle clues in the "irrelevant" paragraphs that make one class more or less likely. Even if these paragraphs are usually irrelevant, they do have a lot of text, and if you average over the all that information you might find it trends towards one class or another pretty strongly. So, it might be that throwing all those "irrelevant" paragraphs away hurts performance.
Is it generally better to allow each paragraph a vote, or to have a second model that learns to pick out the paragraph?
If you're doing the hierarchical approach, you probably want to just sum up the score vectors assigned to all the relevant paragraphs, and then renormalize. If you've got a flat model, you could actually just run it over the whole documents, even though it's trained on the paragraphs.