spacy noun chunk detection


Is there a way to add our custom text phrases chunks as dictionary to spacy model to detect in a given text ?

I'm trying with spacy merge_noun_chunks and getting near matches. But don't want to miss the chunks from a custom list.

hi @sharathreddym!

Thanks for your question.

If I understand your problem correctly, you have known phrases that you want to detect, right? Have you tried spaCy's PhraseMatcher? This is an alternative to a token-based from Matcher.

Similar to the example in the documentation:

import spacy
from spacy.matcher import PhraseMatcher

nlp = spacy.load("en_core_web_sm")
matcher = PhraseMatcher(nlp.vocab)
phrases = ["Barack Obama", "Angela Merkel", "Washington, D.C."]
# Only run nlp.make_doc to speed things up
patterns = [nlp.make_doc(text) for text in phrases]
matcher.add("PhraseList", patterns)

doc = nlp("German Chancellor Angela Merkel and US President Barack Obama "
          "converse in the Oval Office inside the White House in Washington, D.C.")
matches = matcher(doc)
for match_id, start, end in matches:
    span = doc[start:end]
# Angela Merkel
# Barack Obama
# Washington, D.C.

There's also a related spaCy universe extension that does fuzzy PhraseMatching: phruzz_matcher.

If you are comfortable creating components, you may want to create custom components in your pipeline. That is, create a PhraseMatcher to always get any of the custom phrases (could use the fuzzy PhraseMatcher too for a bit more generalization). But then you also create an independent component to detect merge_noun_chunks to catch things your custom phrase list may miss. Then you may want to have a final component that reconciles the logic between the two components -- for example, use custom phrases first. If there isn't a custom phrase detected, then go to merge_noun_chunks component.

Also, FYI, this forum is for Prodigy specific questions. In the future, if you have questions that are specific to spaCy (e.g., creating custom components), you can submit a question on the spaCy GitHub discussion forum instead. Here's a great FAQ to read over before posting on that forum.