Can't merge non-disjoint spans when using terms.train-vectors

Dany · October 30, 2019, 12:20pm

Hey there,

When I try and run terms.vectors-teach I get the following error message:

ValueError: [E102] Can't merge non-disjoint spans. 'the' is already part of tokens to merge. 
If you want to find the longest non-overlapping spans, you can use the util.filter_spans helper:
https://spacy.io/api/top-level#util.filter_spans

Is there something else I am supposed to be doing?

honnibal · November 1, 2019, 4:37pm

Thanks for the report --- it seems that recipe needs to be updated in line with a more recent version of spaCy. Recent versions of spaCy detect conflicting spans if you try to merge multiple overlapping phrases, while previous versions of spaCy ignored those cases. You should be able to modify the recipe to resolve the conflicts. I think something like this should work:

def merge_entities_and_nouns(doc):
    assert doc.is_parsed
    with doc.retokenize() as retokenizer:
        seen_words = set()
        for ent in doc.ents:
            attrs = {"tag": ent.root.tag, "dep": ent.root.dep, "ent_type": ent.label}
            retokenizer.merge(ent, attrs=attrs)
            seen_words.update(w.i for w in ent)
        for np in doc.noun_chunks:
            if any(w.i in seen_words for w in np):
                continue
            attrs = {"tag": np.root.tag, "dep": np.root.dep}
            retokenizer.merge(np, attrs=attrs)
        return doc

Then in your terms.train-vectors recipe, update as follows:

          if merge_ents and merge_nps:
              nlp.add_pipe(merge_entities_and_nouns, name="merge_entities_and_nouns")
              log("RECIPE: Added pipeline component to merge entities and noun chunks")
          elif merge_ents:
              nlp.add_pipe(merge_entities, name="merge_entities")
              log("RECIPE: Added pipeline component to merge entities")
          elif merge_nps:
              nlp.add_pipe(merge_noun_chunks, name="merge_noun_chunks")
              log("RECIPE: Added pipeline component to merge noun chunks")

I think this should solve the problem and allow you to get the vectors trained. However, when you load back the model, spaCy will look for your merge_entities_and_nouns function. You can tell it where to find that by setting Language.factories["merge_entities_and_nouns"] = merge_entities_and_nouns in the script that will use the model. If you just want to use the vectors in terms.teach, an easy solution is to just call nlp.remove_pipe("merge_entities_and_nouns") before you save out the model in the terms.train-vectors recipe. This way the model you save out won't refer to that pipeline component --- which you don't need for the terms.teach recipe.

I hope this helps you work around the problem until we release an updated version of Prodigy. Let me know if anything doesn't work .

Dany · November 1, 2019, 7:48pm

Thanks, @honnibal!

I added the function to the Spacy pipeline and adjusted the terms.train-vectors recipe, but I'm still getting the same error.

I did try also just using the -MN flag and I also got the same error.

Here's some of the traceback:

 for doc in self.nlp.pipe((eg["text"] for eg in stream)):
  File "C:\Users\Danyal\Anaconda3\envs\d-learn\lib\site-packages\spacy\language.py", line 793, in pipe
    for doc in docs:
  File "C:\Users\Danyal\Anaconda3\envs\d-learn\lib\site-packages\spacy\language.py", line 997, in _pipe
    doc = func(doc, **kwargs)
  File "C:\Users\Danyal\Anaconda3\envs\d-learn\lib\site-packages\spacy\pipeline\functions.py", line 69, in merge_entities_and_nouns
    retokenizer.merge(np, attrs=attrs)
  File "_retokenize.pyx", line 60, in spacy.tokens._retokenize.Retokenizer.merge
ValueError: [E102] Can't merge non-disjoint spans. 'you' is already part of tokens to merge. If you want to find the longest non-overlapping spans, you can use the util.filter_spans helper:
https://spacy.io/api/top-level#util.filter_spans

dorabkiss · November 13, 2019, 4:09pm

Hi Dany,

Try this:

def merge_entities_and_nouns(doc):
    assert doc.is_parsed
    with doc.retokenize() as retokenizer:
        seen_words = set()
        for ent in doc.ents:
            attrs = {"tag": ent.root.tag, "dep": ent.root.dep, "ent_type": ent.label}
            retokenizer.merge(ent, attrs=attrs)
            seen_words.update(w.i for w in ent)
        for np in doc.noun_chunks:
            if any(w.i in seen_words for w in np):
                continue
            attrs = {"tag": np.root.tag, "dep": np.root.dep}
            retokenizer.merge(np, attrs=attrs)
            seen_words.update(w.i for w in np) 
        return doc

I think it was trying to merge overlapping noun chunks so I've added a line just above the return statement to add seen noun chunks to the seen_words list.

I hope this helps

dshefman · December 16, 2019, 8:27pm

Hi @honnibal,
When I try to load back my custom model as show below, I get the following AttributeError:

'English' object has no attribute 'is_parsed'

from spacy.language import Language
Language.factories['merge_entities_and_nouns'] = merge_entities_and_nouns
spacy.load("custom_model")

Do you know what I am doing wrong?

ines · December 17, 2019, 8:16pm

@dshefman I think the problem in your code is how you're setting the factory: factories are functions that take the nlp object and any other keyword arguments and return the pipeline component function. This is useful if your pipeline component is a class that needs state, like the shared vocab. In your case, the component is just a function, so you don't want to actually execute it and pass it the nlp object (which results in the error when you try to assert nlp.is_parsed instead of doc.is_parsed). Try the following:

Language.factories["merge_entities_and_nouns"] = lambda nlp, **cfg: merge_entities_and_nouns

dshefman · December 17, 2019, 10:18pm

Thank you @ines. This was helpful.
I ran into another problem when attempting to use my custom model in the textcat.teach recipe. I received the following error when running the code below

prodigy textcat.teach mydataset custom_model file.jsonl --label MYLABEL

error message:
for np in doc.noun_chunks:
File "doc.pyx", line 586, in noun_chunks
ValueError: [E029] noun_chunks requires the dependency parse, which requires a statistical model to be installed and loaded. For more info, see the documentation:
Models & Languages · spaCy Usage Documentation

Then, I noticed in line 65 of the textcat.teach recipe that the "parser" is disabled:

nlp = spacy.load(spacy_model, disable=["ner", "parser"])

The error message went away when I changed this line of the textcat.teach recipe to...

nlp = spacy.load(spacy_model, disable=["ner"])

Any thoughts or feedback as to my solution. Thanks again for the help.

ines · December 18, 2019, 9:36am

@dshefman Yes, your solution is fine – it's one of those cases where your workflow is so custom that you need to adjust the defaults a bit. For most use cases, the base model used in textcat.teach doesn't need the NER and parser, so it's disabled to speed up the model (no need to predict stuff you don't use).

Topic		Replies	Views
Merging a noun_chunk slice for Hearst Pattern Detection usage , spacy , off-topic	1	1221	May 22, 2020
Custom recipe for Annotating Overlapping Spans custom , front-end , best-practices	15	2490	September 6, 2020
Mismatched tokenization	1	492	September 13, 2022
Overlapping Entities ner , solved	2	758	August 20, 2023
Combining and validating spaCy labels and in-house NER output usage , ner	3	264	July 20, 2023

Can't merge non-disjoint spans when using terms.train-vectors

Related topics