Hello! Prodigy has been a very useful tool in taking some steps to build some working models. I’m wondering if I could clarify a couple of potentially basic questions in the context of the insults classifier tutorial.
In this case, you chose to operate on the full text of Reddit posts, and to use textcat
. You also used a training schema in which you looked at the full context of the post, and whether it was really an insult directed at someone, vs. talking about something insulting/offensive.
Let’s say the problem is slightly larger/more complex, and you want to note the presence or absence of an insult, and also the presence or absence of a compliment.
This leads me to two complementary questions:
-
I know
ner
is typically used for recognizing things like people, organizations, locations, etc. Is there a case for usingner
in Prodigy to extract the insult (e.g., “you are awful”) or the compliment (e.g., “you are lovely”) as an entity? -
In your example, you label/classify the entire post as an insult or not. Once you get to multiple labels, is it more sensible to do this at a more granular level if
textcat
is the right route? I’m imagining a post could have two labels (e.g., “You are truly awful. But you are very smart.”), in which case both labels could be applied. If this were a suitable problem forner
, it could potentially separately identify those phrases, buttextcat
might take longer to identify which is the insult and which is the compliment. So perhaps breaking the input down into sentences and classifying that way is the best way to train?
I know often the answer depends on the output, so in this case the ideal output would be a model that can take a chunk of text (either a sentence or as long as a paragraph) and identify whether insults or compliments are present, and ideally also which sentences they occur in.
Thanks for any guidance you can provide on this!