Document classification on large articles.

I am looking to classify articles into one of three classes – A, B, or mixed. A and B are mutually exclusive. The articles are long (potentially up to 20 pages), but of those that can be classified as A or B, a single paragraph quite clearly makes that distinction. The mixed case is not obvious.

The most natural way to do this seems to be classifying on paragraphs, but I am not sure how best to then combine the results. Is it generally better to allow each paragraph a vote, or to have a second model that learns to pick out the paragraph?

Questions:

  • Am I likely to be better off using three distinct classes, or a single “A or not A” binary class, then using the resulting probabilities to decide (E.g. 1.0 -> A, 0.0 -> B, 0.5 -> mixed)?
  • Given the relative infrequency of classifiable paragraphs, how can I get around the problem of the classifier always predicting that a class is not present (and thus not finding positive examples to annotate)?
  • Generally, what is a good starting point for dealing with problems like this?

Hmm.

Given the relative infrequency of classifiable paragraphs, how can I get around the problem of the classifier always predicting that a class is not present (and thus not finding positive examples to annotate)?

How much do you know about the target paragraphs? If you can come up with some keywords, maybe you can use them to bootstrap the classifier, so that you can get some positive examples into it at the start. Once you have some positive examples, I think the active learning should be able to help on this type of problem, because the model will be most unsure on examples where it's predicting the minority class.

Otherwise, you might find that it's actually more efficient to do the first few documents in a PDF reader or something, just so that it's easier to scroll around and skim to find the key paragraph. That might be a bit faster than enqueuing the paragraphs in Prodigy.

Am I likely to be better off using three distinct classes, or a single “A or not A” binary class, then using the resulting probabilities to decide (E.g. 1.0 -> A, 0.0 -> B, 0.5 -> mixed)?

I'd be tempted to try doing it as two problems: irrelevant vs relevant, and then A, B or mixed. The A, B, mixed classifier would only run on the paragraphs marked relevant. This might help with the class imbalance a bit, because you get to group together all relevant paragraphs. On the other hand, if the relevant paragraphs for A are very different from the relevant paragraphs for B, this approach might be worse.

The hierarchical approach might also be worse if there are actually subtle clues in the "irrelevant" paragraphs that make one class more or less likely. Even if these paragraphs are usually irrelevant, they do have a lot of text, and if you average over the all that information you might find it trends towards one class or another pretty strongly. So, it might be that throwing all those "irrelevant" paragraphs away hurts performance.

Is it generally better to allow each paragraph a vote, or to have a second model that learns to pick out the paragraph?

If you're doing the hierarchical approach, you probably want to just sum up the score vectors assigned to all the relevant paragraphs, and then renormalize. If you've got a flat model, you could actually just run it over the whole documents, even though it's trained on the paragraphs.

A reasonable amount. I have made some patterns that show very good agreement (say 90% of the time) with paragraphs that should be decided. I can only make the patterns work if I explicitly alter the token stream to require that a matched pattern is present, however. If I don't do this, I never see positive examples. Any idea why that might be? I assume it is because the ratio of positive/negative examples is so skewed (there are ~250k paragraphs from 5k documents, meaning roughly a 5k/245k ratio of relevant to not).

I actually just parsed the document and rendered the whole thing in prodigy -- labeling the whole document as A/B/Mixed is pretty fast, and will be necessary to validate whatever approach I use. Think 300 examples took less than an hour. How did you imagine these being used? Just as validation, or directly for training?

Is there a way to directly do this with prodigy? Or the second dataset is constructed from paragraph marked as relevant by the model trained on the first?

Hmm, which version of Prodigy are you running? We fixed a bug around this in I think 1.6 or 1.7, so if you're on an earlier version try updating.

It could still be the data skew, but I would expect the patterns to work, so I'm a little suspicious.

I did think they could be used for training, to try to get the active learning started.

There isn't direct support, no. A simple way to do it is to train the first model, and then run it to create a dataset for the second task.