Merging a noun_chunk slice for Hearst Pattern Detection

Fourthought · May 20, 2020, 12:02pm

(also published to Github)

I'm attempting to implement the code from this repository using spaCy matcher in place of regex:

mmichelsonIF/hearst_patterns_python/blob/master/hearstPatterns/hearstPatterns.py

import re
import string
import spacy


class HearstPatterns(object):

    def __init__(self, extended=False):

        self.__adj_stopwords = [
            'able', 'available', 'brief', 'certain',
            'different', 'due', 'enough', 'especially', 'few', 'fifth',
            'former', 'his', 'howbeit', 'immediate', 'important', 'inc',
            'its', 'last', 'latter', 'least', 'less', 'likely', 'little',
            'many', 'ml', 'more', 'most', 'much', 'my', 'necessary',
            'new', 'next', 'non', 'old', 'other', 'our', 'ours', 'own',
            'particular', 'past', 'possible', 'present', 'proud', 'recent',
            'same', 'several', 'significant', 'similar', 'such', 'sup', 'sure'
        ]

This file has been truncated. show original

The plan is to develop the patterns to include custom attributes and dependency relations.

However, I'm having problems with the retokenizer.merge() for merging slices of existing noun_chunks.

The overall problem is to exlcude modifier terms such as, "other" and "some other". They are normally included within the span of a noun_chunk, but are required to be separate as such terms are predicates for particular Hearst Patterns.

The following code has been written to address this problem:

text = "We are using docs, spans, tokens and some other spacy features, such as 
merge_entities, 
merge_noun_chunks and especially retokenizer"
self.predicates = ["some", "some other", "such as", "especially"]

###### relevant patterns:
# hypernym = {"POS" : {"IN": ["NOUN", "PROPN"]}} 
# hyponym = {"POS" : {"IN": ["NOUN", "PROPN"]}}
# punct = {"IS_PUNCT": True, "OP": "?"}
# {"label" : "such_as", "pattern" : [hypernym, punct, {"LEMMA": "such"}, {"LEMMA": 
"as"}, hyponym]}
# {"label" : "especially", "pattern" : [hypernym, punct, {"LEMMA" : "especially"}, 
hyponym]}


# having created the doc object this code iterates through the noun_chunks to 
# remove modifier terms for the pattern matcher. For example, 'some other', which
# are merged as part of a noun_chunk are predicate terms for the Hearst Pattern.

with doc.retokenize() as retokenizer:
        
    for chunk in doc.noun_chunks:

        attrs = {"tag": chunk.root.tag, "dep": chunk.root.dep}

        #iterate through all predicate terms.
        for predicate in self.predicates:
            count = 0

            # iterate through the noun_chunk. If its first, second etc token match those 
            # of a predicate word or phrase, then add to count.

            while count < len(predicate) and doc[chunk.start + count].lemma_ == predicate[count]:
                count += 1

            # Create a new noun_chunk based excluding the number of tokens detected 
            # as part of a predicate phrase.
            # for example "some other spaCy features" become "spaCy features"
            # "especially retokenizer" becomes "retokenizer"

            retokenizer.merge(doc[chunk.start + count : chunk.end], attrs = attrs)`

The problem is happening at the retokenize.merge() stage. In slicing 'some' and 'other' from the noun_chunk 'some other spaCy features' returns the following error message:

[E102] Can't merge non-disjoint spans. 'spaCy' is already part of tokens to merge. If you want to find the longest non-overlapping spans, you can use the util.filter_spans helper:
Top-level Functions · spaCy API Documentation

While it is possible to create a custom attribute containing the filtered spans, I need the noun_chunk spans to be merged within the doc for the Matcher patterns to work.

Where the point is the merge the slice of a span, would using filter_spans return the longest span including the tokens to be excluded?

So you have any ideas as to where I'm going wrong here?

Fourthought · May 22, 2020, 1:10pm

Solved: The problem indeed is merging a noun_chunk slice of zero length.

Have developed the following to prevent zero length chunks:

`

 doc = nlp("We are using docs, spans, tokens and some other spacy features, such as merge entities, merge noun chunks and especially retokenizer")
 doc = nlp("this is a kind of magic")
 predicates = [["some"], ["some", "other"], ["such", "as"], ["especially"], ["a", "kind", "of"]]
 # zero spans were being created by the ["a", "kind", "of"]] predicate term

###### relevant patterns:
# hypernym = {"POS" : {"IN": ["NOUN", "PROPN"]}} 
# hyponym = {"POS" : {"IN": ["NOUN", "PROPN"]}}
# punct = {"IS_PUNCT": True, "OP": "?"}
# {"label" : "such_as", "pattern" : [hypernym, punct, {"LEMMA": "such"}, {"LEMMA": "as"}, hyponym]}
# {"label" : "especially", "pattern" : [hypernym, punct, {"LEMMA" : "especially"}, hyponym]}
# {"label" : "a_kind_of", "pattern" : [hyponym, punct, {"LEMMA" : "a"}, {"LEMMA" : "kind"}, {"LEMMA" : "of"}, hypernym]}

 def isPredicateMatch(self, chunk, predicates):
 # recursive function which returns noun_chunk slice if the predicate terms match the first tokens of the chunk

   def match(empty, count, chunk, predicates):#
        # empty: check whether predicates list is empty
        # chunk[count].lemma_ != predicates[0][-count]: checks convergence between chunk and predicate string, removes empty spans
        # chunk[count].lemma_ == predicates[0][count]: check whether chunk term is equal to the predicates term
        # for example: 
        # "some other spacy features" becomes "spacy features"
        # "especially retokenizer" becomes "retokenizer"
        # "a kind" is annotated by spacy as a noun_chunk, since this appears in the predicate "a kind of" it was being reduced to a zero length span. 
        
        while not empty and chunk[count].lemma_ != predicates[0][-count] and chunk[count].lemma_ == predicates[0][count]:
            count += 1

        return empty, count

    def isMatch(chunk, predicates):

        empty, counter = match(predicates == [], 0, chunk, predicates)
        if empty or counter == len(predicates[0]):
            return chunk[counter:]
        else:
            return isMatch(chunk, predicates[1:])

    return isMatch(chunk, predicates)

 with doc.retokenize() as retokenizer:

        for chunk in doc.noun_chunks:

            attrs = {"tag": chunk.root.tag, "dep": chunk.root.dep}

            retokenizer.merge(isPredicateMatch(chunk, predicates), attrs = attrs)

`

Topic		Replies	Views
Can't merge non-disjoint spans when using terms.train-vectors terms	7	2226	December 18, 2019
Merge Entities Error done , spacy , terms	13	4018	August 26, 2019
adding custom attribute to doc, having NER use attribute ner , spacy	11	5440	March 9, 2018
have a same format after merging. usage , ner , solved	12	708	October 16, 2019
merging a data annotated by regex with the annotated data by prodigy usage , ner , spacy	1	490	August 7, 2019

Merging a noun_chunk slice for Hearst Pattern Detection

Related topics