(also published to Github)
I'm attempting to implement the code from this repository using spaCy matcher in place of regex:
The plan is to develop the patterns to include custom attributes and dependency relations.
However, I'm having problems with the retokenizer.merge() for merging slices of existing noun_chunks.
The overall problem is to exlcude modifier terms such as, "other" and "some other". They are normally included within the span of a noun_chunk, but are required to be separate as such terms are predicates for particular Hearst Patterns.
The following code has been written to address this problem:
text = "We are using docs, spans, tokens and some other spacy features, such as
merge_entities,
merge_noun_chunks and especially retokenizer"
self.predicates = ["some", "some other", "such as", "especially"]
###### relevant patterns:
# hypernym = {"POS" : {"IN": ["NOUN", "PROPN"]}}
# hyponym = {"POS" : {"IN": ["NOUN", "PROPN"]}}
# punct = {"IS_PUNCT": True, "OP": "?"}
# {"label" : "such_as", "pattern" : [hypernym, punct, {"LEMMA": "such"}, {"LEMMA":
"as"}, hyponym]}
# {"label" : "especially", "pattern" : [hypernym, punct, {"LEMMA" : "especially"},
hyponym]}
# having created the doc object this code iterates through the noun_chunks to
# remove modifier terms for the pattern matcher. For example, 'some other', which
# are merged as part of a noun_chunk are predicate terms for the Hearst Pattern.
with doc.retokenize() as retokenizer:
for chunk in doc.noun_chunks:
attrs = {"tag": chunk.root.tag, "dep": chunk.root.dep}
#iterate through all predicate terms.
for predicate in self.predicates:
count = 0
# iterate through the noun_chunk. If its first, second etc token match those
# of a predicate word or phrase, then add to count.
while count < len(predicate) and doc[chunk.start + count].lemma_ == predicate[count]:
count += 1
# Create a new noun_chunk based excluding the number of tokens detected
# as part of a predicate phrase.
# for example "some other spaCy features" become "spaCy features"
# "especially retokenizer" becomes "retokenizer"
retokenizer.merge(doc[chunk.start + count : chunk.end], attrs = attrs)`
The problem is happening at the retokenize.merge()
stage. In slicing 'some' and 'other' from the noun_chunk 'some other spaCy features' returns the following error message:
[E102] Can't merge non-disjoint spans. 'spaCy' is already part of tokens to merge. If you want to find the longest non-overlapping spans, you can use the util.filter_spans helper:
Top-level Functions ยท spaCy API Documentation
While it is possible to create a custom attribute containing the filtered spans, I need the noun_chunk spans to be merged within the doc
for the Matcher
patterns to work.
Where the point is the merge the slice of a span, would using filter_spans
return the longest span including the tokens to be excluded?
So you have any ideas as to where I'm going wrong here?