Do I need to use two models?

Hi! I’m trying to build a text category classifier for JIRA tickets. I found some good advice in Document classification on large articles. and split this up into two separate operations:

  1. Train a binary classifier to separate out text typed by humans from log files, error messages, etc.
  2. Train a binary classifier that determines whether the ‘human’ information output by the first model might be about a product I’m interested in.

Right now I have this as two separate models, with the output from the first being passed through the second as part of the prediction workflow. Is this the right approach, or is there some better way I can do this?

I think that sounds like a reasonable way to structure it, especially for annotation efficiency. Hopefully you could also have some rules in stage 1 to filter out some of the machine-generated log lines.

Ultimately it’s experimental though – you could at some point try training a single model with two to four classes. The combined model might be just as accurate, while being easier to deploy and debug. On the other hand, if efficiency is a concern, then maybe you can make model 1 run quickly and only apply a more expensive model 2 on the subset of lines it selects. For annotation though, I do think your process sounds good.