Hello,
Is there a way to add our custom text phrases chunks as dictionary to spacy model to detect in a given text ?
I'm trying with spacy merge_noun_chunks and getting near matches. But don't want to miss the chunks from a custom list.
Hello,
Is there a way to add our custom text phrases chunks as dictionary to spacy model to detect in a given text ?
I'm trying with spacy merge_noun_chunks and getting near matches. But don't want to miss the chunks from a custom list.
hi @sharathreddym!
Thanks for your question.
If I understand your problem correctly, you have known phrases that you want to detect, right? Have you tried spaCy's PhraseMatcher
? This is an alternative to a token-based from Matcher
.
Similar to the example in the documentation:
import spacy
from spacy.matcher import PhraseMatcher
nlp = spacy.load("en_core_web_sm")
matcher = PhraseMatcher(nlp.vocab)
phrases = ["Barack Obama", "Angela Merkel", "Washington, D.C."]
# Only run nlp.make_doc to speed things up
patterns = [nlp.make_doc(text) for text in phrases]
matcher.add("PhraseList", patterns)
doc = nlp("German Chancellor Angela Merkel and US President Barack Obama "
"converse in the Oval Office inside the White House in Washington, D.C.")
matches = matcher(doc)
for match_id, start, end in matches:
span = doc[start:end]
print(span.text)
# Angela Merkel
# Barack Obama
# Washington, D.C.
There's also a related spaCy universe extension that does fuzzy PhraseMatching: phruzz_matcher
.
If you are comfortable creating components, you may want to create custom components in your pipeline. That is, create a PhraseMatcher to always get any of the custom phrases (could use the fuzzy PhraseMatcher too for a bit more generalization). But then you also create an independent component to detect merge_noun_chunks
to catch things your custom phrase list may miss. Then you may want to have a final component that reconciles the logic between the two components -- for example, use custom phrases first. If there isn't a custom phrase detected, then go to merge_noun_chunks
component.
Also, FYI, this forum is for Prodigy specific questions. In the future, if you have questions that are specific to spaCy (e.g., creating custom components), you can submit a question on the spaCy GitHub discussion forum instead. Here's a great FAQ to read over before posting on that forum.