ignore strings for dependency parser

mhigginslp · May 9, 2018, 4:56pm

I am using special tags designating the speaker of the text but the problem is the dependency parser is treating them as strings in the sentence and messing up the parse … is there a way to ignore them?

ines · May 9, 2018, 5:34pm

Just to clarify: Your "text" contains strings that are the speaker info and should not be treated as actual text? Do you have an (actual or abstract) example of the data you’re feeding in? I just wan to make sure I understand the problem correctly

mhigginslp · May 9, 2018, 5:55pm

Yes, that is correct - its meta-data about the text that I don't want the model to us. For example:

"<a> Hello Name! I'll be happy to check this. May i have the model number please? <c> Item # xxxxx model # xxxx <a> Thank you. Just to be sure your zip code is xxxxx correct? <c> Correct"

After sentence splitting I get:

"<a>
Hello Name! 
I'll be happy to check this. 
May i have the model number please?
<c> Item # xxxxx model # xxxx <a> 
Thank you. 
Just to be sure your zip code is xxxxx correct? 
<c>
Correct"

In the end we would like to have a format like this:

"<a> Hello Name! 
<a> I'll be happy to check this. 
<a> May i have the model number please?
<c> Item # xxxxx model # xxxx
<a> Thank you. 
<a> Just to be sure your zip code is xxxxx correct? 
<c> Correct"

mhigginslp · May 9, 2018, 7:14pm

So I found the docs regarding the custom sentence splitter, that was the trick! Once we automatically split sentences at our special tags <a>,<c> the rest of the job of filling out the tags worked smoothly. I’m not sure why the add_special_case was needed.

     text = u"<a> that  be difficult.you can inquire  and if you well. <c> thanks but no thanks. signing off <a> thanks for shopping"
    
     nlp = spacy.load("en_core_web_sm")
     nlp.tokenizer.add_special_case(u"<c>", [{ORTH: u"<c>"}])
     nlp.tokenizer.add_special_case(u"<a>", [{ORTH: u"<a>"}])
     doc = nlp(text)
     print('Before:', [sent.text for sent in doc.sents])
    
     def set_custom_boundaries(doc):
         for token in doc[:-1]:
             if token.text == u'<a>' or token.text == u'<c>':
                 doc[token.i].is_sent_start = True
         return doc
   
    nlp.add_pipe(set_custom_boundaries, before='parser')
    doc = nlp(text)
    print('After:', [sent.text for sent in doc.sents])

Topic		Replies	Views
Disable sentence boundary detection in Spacy Parser spacy	2	396	February 19, 2023
Training dependency parser usage , ner , done , spacy	5	3879	March 11, 2018
Wrong tokenization on commas preceded by a special character usage , spacy	5	1741	October 4, 2019
Advise on Dependency Training for Improving Sentence Breaking usage , spacy , dep	1	562	July 6, 2020
Special cases in tokenization usage , ner , spacy	3	459	March 10, 2022

ignore strings for dependency parser

Related topics