Merging a noun_chunk slice for Hearst Pattern Detection

(also published to Github)

I'm attempting to implement the code from this repository using spaCy matcher in place of regex:

The plan is to develop the patterns to include custom attributes and dependency relations.

However, I'm having problems with the retokenizer.merge() for merging slices of existing noun_chunks.

The overall problem is to exlcude modifier terms such as, "other" and "some other". They are normally included within the span of a noun_chunk, but are required to be separate as such terms are predicates for particular Hearst Patterns.

The following code has been written to address this problem:

text = "We are using docs, spans, tokens and some other spacy features, such as 
merge_entities, 
merge_noun_chunks and especially retokenizer"
self.predicates = ["some", "some other", "such as", "especially"]

###### relevant patterns:
# hypernym = {"POS" : {"IN": ["NOUN", "PROPN"]}} 
# hyponym = {"POS" : {"IN": ["NOUN", "PROPN"]}}
# punct = {"IS_PUNCT": True, "OP": "?"}
# {"label" : "such_as", "pattern" : [hypernym, punct, {"LEMMA": "such"}, {"LEMMA": 
"as"}, hyponym]}
# {"label" : "especially", "pattern" : [hypernym, punct, {"LEMMA" : "especially"}, 
hyponym]}


# having created the doc object this code iterates through the noun_chunks to 
# remove modifier terms for the pattern matcher. For example, 'some other', which
# are merged as part of a noun_chunk are predicate terms for the Hearst Pattern.

with doc.retokenize() as retokenizer:
        
    for chunk in doc.noun_chunks:

        attrs = {"tag": chunk.root.tag, "dep": chunk.root.dep}

        #iterate through all predicate terms.
        for predicate in self.predicates:
            count = 0

            # iterate through the noun_chunk. If its first, second etc token match those 
            # of a predicate word or phrase, then add to count.

            while count < len(predicate) and doc[chunk.start + count].lemma_ == predicate[count]:
                count += 1

            # Create a new noun_chunk based excluding the number of tokens detected 
            # as part of a predicate phrase.
            # for example "some other spaCy features" become "spaCy features"
            # "especially retokenizer" becomes "retokenizer"

            retokenizer.merge(doc[chunk.start + count : chunk.end], attrs = attrs)`

The problem is happening at the retokenize.merge() stage. In slicing 'some' and 'other' from the noun_chunk 'some other spaCy features' returns the following error message:

[E102] Can't merge non-disjoint spans. 'spaCy' is already part of tokens to merge. If you want to find the longest non-overlapping spans, you can use the util.filter_spans helper:
Top-level Functions ยท spaCy API Documentation

While it is possible to create a custom attribute containing the filtered spans, I need the noun_chunk spans to be merged within the doc for the Matcher patterns to work.

Where the point is the merge the slice of a span, would using filter_spans return the longest span including the tokens to be excluded?

So you have any ideas as to where I'm going wrong here?

Solved: The problem indeed is merging a noun_chunk slice of zero length.

Have developed the following to prevent zero length chunks:

`

 doc = nlp("We are using docs, spans, tokens and some other spacy features, such as merge entities, merge noun chunks and especially retokenizer")
 doc = nlp("this is a kind of magic")
 predicates = [["some"], ["some", "other"], ["such", "as"], ["especially"], ["a", "kind", "of"]]
 # zero spans were being created by the ["a", "kind", "of"]] predicate term

###### relevant patterns:
# hypernym = {"POS" : {"IN": ["NOUN", "PROPN"]}} 
# hyponym = {"POS" : {"IN": ["NOUN", "PROPN"]}}
# punct = {"IS_PUNCT": True, "OP": "?"}
# {"label" : "such_as", "pattern" : [hypernym, punct, {"LEMMA": "such"}, {"LEMMA": "as"}, hyponym]}
# {"label" : "especially", "pattern" : [hypernym, punct, {"LEMMA" : "especially"}, hyponym]}
# {"label" : "a_kind_of", "pattern" : [hyponym, punct, {"LEMMA" : "a"}, {"LEMMA" : "kind"}, {"LEMMA" : "of"}, hypernym]}

 def isPredicateMatch(self, chunk, predicates):
 # recursive function which returns noun_chunk slice if the predicate terms match the first tokens of the chunk

   def match(empty, count, chunk, predicates):#
        # empty: check whether predicates list is empty
        # chunk[count].lemma_ != predicates[0][-count]: checks convergence between chunk and predicate string, removes empty spans
        # chunk[count].lemma_ == predicates[0][count]: check whether chunk term is equal to the predicates term
        # for example: 
        # "some other spacy features" becomes "spacy features"
        # "especially retokenizer" becomes "retokenizer"
        # "a kind" is annotated by spacy as a noun_chunk, since this appears in the predicate "a kind of" it was being reduced to a zero length span. 
        
        while not empty and chunk[count].lemma_ != predicates[0][-count] and chunk[count].lemma_ == predicates[0][count]:
            count += 1

        return empty, count

    def isMatch(chunk, predicates):

        empty, counter = match(predicates == [], 0, chunk, predicates)
        if empty or counter == len(predicates[0]):
            return chunk[counter:]
        else:
            return isMatch(chunk, predicates[1:])

    return isMatch(chunk, predicates)

 with doc.retokenize() as retokenizer:

        for chunk in doc.noun_chunks:

            attrs = {"tag": chunk.root.tag, "dep": chunk.root.dep}

            retokenizer.merge(isPredicateMatch(chunk, predicates), attrs = attrs)

`