Can't merge non-disjoint spans when using terms.train-vectors

Hey there,

When I try and run terms.vectors-teach I get the following error message:

ValueError: [E102] Can't merge non-disjoint spans. 'the' is already part of tokens to merge. 
If you want to find the longest non-overlapping spans, you can use the util.filter_spans helper:
https://spacy.io/api/top-level#util.filter_spans

Is there something else I am supposed to be doing?

Thanks for the report --- it seems that recipe needs to be updated in line with a more recent version of spaCy. Recent versions of spaCy detect conflicting spans if you try to merge multiple overlapping phrases, while previous versions of spaCy ignored those cases. You should be able to modify the recipe to resolve the conflicts. I think something like this should work:

def merge_entities_and_nouns(doc):
    assert doc.is_parsed
    with doc.retokenize() as retokenizer:
        seen_words = set()
        for ent in doc.ents:
            attrs = {"tag": ent.root.tag, "dep": ent.root.dep, "ent_type": ent.label}
            retokenizer.merge(ent, attrs=attrs)
            seen_words.update(w.i for w in ent)
        for np in doc.noun_chunks:
            if any(w.i in seen_words for w in np):
                continue
            attrs = {"tag": np.root.tag, "dep": np.root.dep}
            retokenizer.merge(np, attrs=attrs)
        return doc

Then in your terms.train-vectors recipe, update as follows:

          if merge_ents and merge_nps:
              nlp.add_pipe(merge_entities_and_nouns, name="merge_entities_and_nouns")
              log("RECIPE: Added pipeline component to merge entities and noun chunks")
          elif merge_ents:
              nlp.add_pipe(merge_entities, name="merge_entities")
              log("RECIPE: Added pipeline component to merge entities")
          elif merge_nps:
              nlp.add_pipe(merge_noun_chunks, name="merge_noun_chunks")
              log("RECIPE: Added pipeline component to merge noun chunks")

I think this should solve the problem and allow you to get the vectors trained. However, when you load back the model, spaCy will look for your merge_entities_and_nouns function. You can tell it where to find that by setting Language.factories["merge_entities_and_nouns"] = merge_entities_and_nouns in the script that will use the model. If you just want to use the vectors in terms.teach, an easy solution is to just call nlp.remove_pipe("merge_entities_and_nouns") before you save out the model in the terms.train-vectors recipe. This way the model you save out won't refer to that pipeline component --- which you don't need for the terms.teach recipe.

I hope this helps you work around the problem until we release an updated version of Prodigy. Let me know if anything doesn't work :slight_smile:.

Thanks, @honnibal!

I added the function to the Spacy pipeline and adjusted the terms.train-vectors recipe, but I'm still getting the same error.

I did try also just using the -MN flag and I also got the same error.

Here's some of the traceback:

 for doc in self.nlp.pipe((eg["text"] for eg in stream)):
  File "C:\Users\Danyal\Anaconda3\envs\d-learn\lib\site-packages\spacy\language.py", line 793, in pipe
    for doc in docs:
  File "C:\Users\Danyal\Anaconda3\envs\d-learn\lib\site-packages\spacy\language.py", line 997, in _pipe
    doc = func(doc, **kwargs)
  File "C:\Users\Danyal\Anaconda3\envs\d-learn\lib\site-packages\spacy\pipeline\functions.py", line 69, in merge_entities_and_nouns
    retokenizer.merge(np, attrs=attrs)
  File "_retokenize.pyx", line 60, in spacy.tokens._retokenize.Retokenizer.merge
ValueError: [E102] Can't merge non-disjoint spans. 'you' is already part of tokens to merge. If you want to find the longest non-overlapping spans, you can use the util.filter_spans helper:
https://spacy.io/api/top-level#util.filter_spans