Reclassifying text fragments with custom NER

Hi,

I have a use case where my own model predicts text fragments which contain some NER-like meaning like:

I enjoyed, surprisingly, doing this a lot.

The fragment enjoy this will be marked as a NER called “enjoyment phrase”,.

But it is very rule-based and yields lots of problems with contextual negations and other stuff. For example it will catch a false positive:

I never enjoy this.

I’d like to use prodigy to go through my labeled data and make it include contextual stuff like negations etc.

How do I do this? Can I load the pre-trained data and reclassify it somehow and prodigy will be able to discern the my corrections from original classifications and then propagate the corrections over the rest of the set? Or do I use text classification and try to learn a classifier outputting a “True positive” label? In the second approach, how do I pass the selected fragments, to the model in the data?

Best,
Piotr

Well, this isn't really Named Entity Recognition (NER). The parts you're trying to recognise aren't contiguous phrases, and actually the syntactic structure can be pretty complicated. You might find it easier to use the dependency parse for writing these rules.

Here's how the sentence you gave would be parsed: displaCy Dependency Visualizer · Explosion . You can play with the API for this here: Linguistic Features · spaCy Usage Documentation

I'm worried that your classification scheme probably isn't very well defined. Like, what exactly will be an "enjoyment phrase"? Consider the following examples:

  • I enjoy that
  • That is enjoyed by me
  • I enjoy doing that
  • Doing that is enjoyed by me
  • That is enjoyable
  • It is enjoyable to do that
  • Doing that makes me happy
  • I'm happy when I'm doing that
  • I was happy, because I did that.
  • I did that. I became happy.
    etc

If you don't have a linguistically precise definition of what counts and what doesn't, you won't be able to annotate accurately --- let alone replicate those annotations in a machine learning model.

I would suggest collecting a set of trigger words (fun, enjoy, etc). If the words are ambiguous (i.e. some word is sometimes a trigger, sometimes not) you can use the NER model to predict a context-specific label. You'd collect annotations for that task.

I would advise against building negation into the defintion of the trigger word. You should annotate that separately. Otherwise, the model which has to learn the trigger words will have a very difficult task: it's trying to learn two pieces of information jointly, even though the negation word might be arbitrarily far apart from the trigger word.

If this is an important or commercially valuable project, I would suggest trying to find some annotators with linguistic experience (e.g. at least a few undergraduate courses in syntax) to help you make sure your annotation scheme makes sense.