spacy noun chunk detection

sharathreddym · August 9, 2022, 4:14am

Hello,

Is there a way to add our custom text phrases chunks as dictionary to spacy model to detect in a given text ?

I'm trying with spacy merge_noun_chunks and getting near matches. But don't want to miss the chunks from a custom list.

ryanwesslen · August 9, 2022, 3:36pm

Thanks for your question.

If I understand your problem correctly, you have known phrases that you want to detect, right? Have you tried spaCy's PhraseMatcher? This is an alternative to a token-based from Matcher.

Similar to the example in the documentation:

import spacy
from spacy.matcher import PhraseMatcher

nlp = spacy.load("en_core_web_sm")
matcher = PhraseMatcher(nlp.vocab)
phrases = ["Barack Obama", "Angela Merkel", "Washington, D.C."]
# Only run nlp.make_doc to speed things up
patterns = [nlp.make_doc(text) for text in phrases]
matcher.add("PhraseList", patterns)

doc = nlp("German Chancellor Angela Merkel and US President Barack Obama "
          "converse in the Oval Office inside the White House in Washington, D.C.")
matches = matcher(doc)
for match_id, start, end in matches:
    span = doc[start:end]
    print(span.text)
# Angela Merkel
# Barack Obama
# Washington, D.C.

There's also a related spaCy universe extension that does fuzzy PhraseMatching: phruzz_matcher.

If you are comfortable creating components, you may want to create custom components in your pipeline. That is, create a PhraseMatcher to always get any of the custom phrases (could use the fuzzy PhraseMatcher too for a bit more generalization). But then you also create an independent component to detect merge_noun_chunks to catch things your custom phrase list may miss. Then you may want to have a final component that reconciles the logic between the two components -- for example, use custom phrases first. If there isn't a custom phrase detected, then go to merge_noun_chunks component.

Also, FYI, this forum is for Prodigy specific questions. In the future, if you have questions that are specific to spaCy (e.g., creating custom components), you can submit a question on the spaCy GitHub discussion forum instead. Here's a great FAQ to read over before posting on that forum.

Topic		Replies	Views
Fuzzy (partial) matching with PhraseMatcher (NER task) usage , spacy , solved , medical	10	10068	January 13, 2020
Can't get phrase matching to work spancat	3	295	June 27, 2023
Merging a noun_chunk slice for Hearst Pattern Detection usage , spacy , off-topic	1	1222	May 22, 2020
NER or PhraseMatcher? ner , spacy , best-practices	17	6092	September 20, 2018
PhraseMatcher or the EntityRuler? off-topic	0	406	October 27, 2020

spacy noun chunk detection

Related topics