ignore strings for dependency parser

I am using special tags designating the speaker of the text but the problem is the dependency parser is treating them as strings in the sentence and messing up the parse … is there a way to ignore them?

Just to clarify: Your "text" contains strings that are the speaker info and should not be treated as actual text? Do you have an (actual or abstract) example of the data you’re feeding in? I just wan to make sure I understand the problem correctly :slightly_smiling_face:

Yes, that is correct - its meta-data about the text that I don’t want the model to us. For example:

"<a> Hello Name! I'll be happy to check this. May i have the model number please? <c> Item # xxxxx model # xxxx <a> Thank you. Just to be sure your zip code is xxxxx correct? <c> Correct"

After sentence splitting I get:

"<a>
Hello Name! 
I'll be happy to check this. 
May i have the model number please?
<c> Item # xxxxx model # xxxx <a> 
Thank you. 
Just to be sure your zip code is xxxxx correct? 
<c>
Correct"

In the end we would like to have a format like this:

"<a> Hello Name! 
<a> I'll be happy to check this. 
<a> May i have the model number please?
<c> Item # xxxxx model # xxxx
<a> Thank you. 
<a> Just to be sure your zip code is xxxxx correct? 
<c> Correct"

So I found the docs regarding the custom sentence splitter, that was the trick! Once we automatically split sentences at our special tags <a>,<c> the rest of the job of filling out the tags worked smoothly. I’m not sure why the add_special_case was needed.

     text = u"<a> that  be difficult.you can inquire  and if you well. <c> thanks but no thanks. signing off <a> thanks for shopping"
    
     nlp = spacy.load("en_core_web_sm")
     nlp.tokenizer.add_special_case(u"<c>", [{ORTH: u"<c>"}])
     nlp.tokenizer.add_special_case(u"<a>", [{ORTH: u"<a>"}])
     doc = nlp(text)
     print('Before:', [sent.text for sent in doc.sents])
    
     def set_custom_boundaries(doc):
         for token in doc[:-1]:
             if token.text == u'<a>' or token.text == u'<c>':
                 doc[token.i].is_sent_start = True
         return doc
   
    nlp.add_pipe(set_custom_boundaries, before='parser')
    doc = nlp(text)
    print('After:', [sent.text for sent in doc.sents])