When I try and run terms.vectors-teach I get the following error message:
ValueError: [E102] Can't merge non-disjoint spans. 'the' is already part of tokens to merge.
If you want to find the longest non-overlapping spans, you can use the util.filter_spans helper:
https://spacy.io/api/top-level#util.filter_spans
Is there something else I am supposed to be doing?
Thanks for the report --- it seems that recipe needs to be updated in line with a more recent version of spaCy. Recent versions of spaCy detect conflicting spans if you try to merge multiple overlapping phrases, while previous versions of spaCy ignored those cases. You should be able to modify the recipe to resolve the conflicts. I think something like this should work:
def merge_entities_and_nouns(doc):
assert doc.is_parsed
with doc.retokenize() as retokenizer:
seen_words = set()
for ent in doc.ents:
attrs = {"tag": ent.root.tag, "dep": ent.root.dep, "ent_type": ent.label}
retokenizer.merge(ent, attrs=attrs)
seen_words.update(w.i for w in ent)
for np in doc.noun_chunks:
if any(w.i in seen_words for w in np):
continue
attrs = {"tag": np.root.tag, "dep": np.root.dep}
retokenizer.merge(np, attrs=attrs)
return doc
Then in your terms.train-vectors recipe, update as follows:
if merge_ents and merge_nps:
nlp.add_pipe(merge_entities_and_nouns, name="merge_entities_and_nouns")
log("RECIPE: Added pipeline component to merge entities and noun chunks")
elif merge_ents:
nlp.add_pipe(merge_entities, name="merge_entities")
log("RECIPE: Added pipeline component to merge entities")
elif merge_nps:
nlp.add_pipe(merge_noun_chunks, name="merge_noun_chunks")
log("RECIPE: Added pipeline component to merge noun chunks")
I think this should solve the problem and allow you to get the vectors trained. However, when you load back the model, spaCy will look for your merge_entities_and_nouns function. You can tell it where to find that by setting Language.factories["merge_entities_and_nouns"] = merge_entities_and_nouns in the script that will use the model. If you just want to use the vectors in terms.teach, an easy solution is to just call nlp.remove_pipe("merge_entities_and_nouns") before you save out the model in the terms.train-vectors recipe. This way the model you save out won't refer to that pipeline component --- which you don't need for the terms.teach recipe.
I hope this helps you work around the problem until we release an updated version of Prodigy. Let me know if anything doesn't work .
I added the function to the Spacy pipeline and adjusted the terms.train-vectors recipe, but I'm still getting the same error.
I did try also just using the -MN flag and I also got the same error.
Here's some of the traceback:
for doc in self.nlp.pipe((eg["text"] for eg in stream)):
File "C:\Users\Danyal\Anaconda3\envs\d-learn\lib\site-packages\spacy\language.py", line 793, in pipe
for doc in docs:
File "C:\Users\Danyal\Anaconda3\envs\d-learn\lib\site-packages\spacy\language.py", line 997, in _pipe
doc = func(doc, **kwargs)
File "C:\Users\Danyal\Anaconda3\envs\d-learn\lib\site-packages\spacy\pipeline\functions.py", line 69, in merge_entities_and_nouns
retokenizer.merge(np, attrs=attrs)
File "_retokenize.pyx", line 60, in spacy.tokens._retokenize.Retokenizer.merge
ValueError: [E102] Can't merge non-disjoint spans. 'you' is already part of tokens to merge. If you want to find the longest non-overlapping spans, you can use the util.filter_spans helper:
https://spacy.io/api/top-level#util.filter_spans
def merge_entities_and_nouns(doc):
assert doc.is_parsed
with doc.retokenize() as retokenizer:
seen_words = set()
for ent in doc.ents:
attrs = {"tag": ent.root.tag, "dep": ent.root.dep, "ent_type": ent.label}
retokenizer.merge(ent, attrs=attrs)
seen_words.update(w.i for w in ent)
for np in doc.noun_chunks:
if any(w.i in seen_words for w in np):
continue
attrs = {"tag": np.root.tag, "dep": np.root.dep}
retokenizer.merge(np, attrs=attrs)
seen_words.update(w.i for w in np)
return doc
I think it was trying to merge overlapping noun chunks so I've added a line just above the return statement to add seen noun chunks to the seen_words list.
@dshefman I think the problem in your code is how you're setting the factory: factories are functions that take the nlp object and any other keyword arguments and return the pipeline component function. This is useful if your pipeline component is a class that needs state, like the shared vocab. In your case, the component is just a function, so you don't want to actually execute it and pass it the nlp object (which results in the error when you try to assert nlp.is_parsed instead of doc.is_parsed). Try the following:
Thank you @ines. This was helpful.
I ran into another problem when attempting to use my custom model in the textcat.teach recipe. I received the following error when running the code below
error message:
for np in doc.noun_chunks:
File "doc.pyx", line 586, in noun_chunks
ValueError: [E029] noun_chunks requires the dependency parse, which requires a statistical model to be installed and loaded. For more info, see the documentation: Models & Languages ยท spaCy Usage Documentation
Then, I noticed in line 65 of the textcat.teach recipe that the "parser" is disabled:
@dshefman Yes, your solution is fine โ it's one of those cases where your workflow is so custom that you need to adjust the defaults a bit. For most use cases, the base model used in textcat.teach doesn't need the NER and parser, so it's disabled to speed up the model (no need to predict stuff you don't use).