Will NER improve Text Categorization?


There's a question I have that's maybe more regarding NLP in general than Prodigy or SpaCy. I thought I'd try here since it seems you folks are so responsive and helpful - but please let me know if this isn't the right place to ask.

I was wondering - if I'm doing text categorization with SpaCy, using textcat-multi for example, will those results improve if an NER component was before it in the pipeline? My thinking about that is: if a sentence like "Senior Javascript Developer" would be categorized as, say, "A" (or any other category), and if then Javascript would be tagged as a "Programming Language" entity or similar, would the textcat pick that up, and use that to say, for example, a sentence like "Python Engineer", is similar because of that entity? Assuming Python is also a "Programming Language" entity of course.

My understanding of it is that the textcat component will take the tok2vec vectors and look for similarity there, but will the vectors be similar in one or more dimensions if the found entity using NER is similar? Am I thinking about this the right way? If it's at all possible, how would that work with SpaCy and/or Prodigy?

Thanks a bunch in advance, and do let me know if this isn't the right space to ask these questions!

Hi Valentijn,

spaCy questions are better asked on our GitHub Discussions board. The spaCy contributors also keep an eye on that forum, which is why I recommend going there.

In fact, your question seems partially answered there.


@valentijnnieman , just curious if you were able to obtain answers to your questions?

Likely, I’ll have same/ similar question in the nearest future.

@koaning thanks for pointing out the topic on GitHub. It’s a bit hard for me to completely understand the answer at this point. Need to do more homework as I’m just starting to dig into the topic.

Thank you.