Entity Recognition vs. Text Classification

Hello! Prodigy has been a very useful tool in taking some steps to build some working models. I’m wondering if I could clarify a couple of potentially basic questions in the context of the insults classifier tutorial.

In this case, you chose to operate on the full text of Reddit posts, and to use textcat. You also used a training schema in which you looked at the full context of the post, and whether it was really an insult directed at someone, vs. talking about something insulting/offensive.

Let’s say the problem is slightly larger/more complex, and you want to note the presence or absence of an insult, and also the presence or absence of a compliment.

This leads me to two complementary questions:

  1. I know ner is typically used for recognizing things like people, organizations, locations, etc. Is there a case for using ner in Prodigy to extract the insult (e.g., “you are awful”) or the compliment (e.g., “you are lovely”) as an entity?

  2. In your example, you label/classify the entire post as an insult or not. Once you get to multiple labels, is it more sensible to do this at a more granular level if textcat is the right route? I’m imagining a post could have two labels (e.g., “You are truly awful. But you are very smart.”), in which case both labels could be applied. If this were a suitable problem for ner, it could potentially separately identify those phrases, but textcat might take longer to identify which is the insult and which is the compliment. So perhaps breaking the input down into sentences and classifying that way is the best way to train?

I know often the answer depends on the output, so in this case the ideal output would be a model that can take a chunk of text (either a sentence or as long as a paragraph) and identify whether insults or compliments are present, and ideally also which sentences they occur in.

Thanks for any guidance you can provide on this!

1 Like

You can try, but I would mildly discourage this. The entity recognizer is driven by a state-machine, that reads the sentence left-to-right and decides when to open a new entity, and when to close an existing entity. This algorithm has the right sort of structural bias for picking up proper nouns, but probably has the wrong sort of bias for arbitrary semantic categories like "INSULT".

The thing is, you'll never get consistency and clarity about where the insult stops and starts. Those boundary inconsistencies will make the category incredibly difficult for the model to learn. Another problem is that the most relevant word will often be verbs towards the middle of the span you're annotating, while the model is set up to look at the boundaries.

That's definitely a fair approach to try, if you can segment the text into sentences.

Another idea for you: you might consider applying labels to single words, and then using the word labels as a basis for further logic. The downstream logic might be rule-based, or another statistical model.

For instance, you could start off by building a terminology list that marks whether the word can be the trigger word for an insult. You can cast a wide net in this phase, and draw in words that are sometimes insults if they're part of a longer phrase, even if that's a minority of the word's usage. We could call this list, potential triggers. Next, you could run ner.teach, to mark whether we have an actual trigger. This will indicate we have an active insult in the sentence.

At this point we'll have information about which word was used to make the decision that the sentence was an insult. You could then use the syntactic parse to determine the boundaries of the insult, by writing some rules. For instance, if the trigger is a noun, we probably want to get its leftward children as well, to make the longer phrase.

The idea is to break the task down into subproblems, where each subproblem is very clearly defined, and determined by the local context. This makes the subproblems quick to annotate, and quick for a model to learn. You then compose the pieces with rules, to get suggestions for more abstract or detailed annotations. Once you have the system suggesting those annotations, you can accept/reject them, and build up a training set that might be large enough to try an end-to-end model.

This is extremely helpful, I appreciate your help in clarifying these questions!