Hi guys,
quick question… I am training a new entity (similiar to ORG. it is actually the “current employed org” for staffing recruiters typing in notes like “Carl is currently at BoA” or “John is working at Home Depot”… )
the training is going well… but the NER stumbles often with picking up the full “phrase” of the company name, due mostly to the recruiters lack of using capitalization, for instance “John is working at Home depot but really wants to move on”… NER does well with “Home Depot”… but not so much with “Home depot”
I would like NER to capture the full phrase “Home depot” (which of course I am labelling and correcting through ner.make-gold). but… If I were forced to build a rule-based system, I have found a distinct pattern (through repetition of labelling) that their is often a conjunction or other POS transition type word that usually ends the company phrase (like the “but” in my Hope depot example above)… some others : “then/during/…” and almost never appears as part of the company name phrase (how many companies do you know named “Home but” etc.)
In short… Just curious… does the spacy modelling algo (which algo is it BTW? ) take into account POS, e.g. like noun phrases, or these conjunctions in it’s figuring out the correct boundaries? Is there any way to influence this model with rules (other than upfront or after-the-fact post or pre processing of results?)
Will the spacy model pick up this word-transition-boundary correctly without me pushing it?
Thanks