I'm looking to label chat message dependencies and want to improve the spacy parser sentence breaks. I currently use a combination of my own sentence breaks & overrides of the parser. I'm hoping to improve the parser at this point to better separate phrases in a message, but I haven't been able to find an example or find any guidance on how to proceed.
- how does the dependency parser decide when to break a sentence?
- what considerations should I make if I build a custom dependency parser to get sentence breaks working?
I'm assuming you're using spaCy here.
The dependency parser in the default models is trained to jointly predict sentence boundaries at the same time as it parses the rest of the sentence. The parser uses a transition-based formulation, which means that it works as a state machine, and the learning problem is to predict which action to make given the current state. There are actions to push and pop words from a stack and a queue, add arcs between words on the stack and the queue, and also to insert sentence breaks. A similar approach is described here: https://www.aclweb.org/anthology/P16-1181/
The details of the SBD parsing model aren't that relevant to understand, however. The way that I would recommend you improve the sentence breaks is probably to insert a component before the parser that sets some or all of the
token.is_sent_start attributes. This attribute takes a ternary value in
(None, True, False), where
None indicates the information is missing. The parser will respect previous
False designations, and come up with a parse structure that respects those boundaries (so no dependency arcs will cross a preset sentence boundary, and sentence breaks will not be inserted on words set to
There are also a number of sentence segmenters by third-parties in the spaCy universe. You could give those a try to see if they work well on your data.