Disable sentence boundary detection in Spacy Parser

My text input to Spacy is already in one sentence per line format. So I would like to switch off the sentence boundary detection in the parser.

Is there any config setting that controls the sentence boundary detection by the parser ? If not, is there a work around I can employ to let the parser assign dependency tags but not do sentence boundary detection ?

I would like to take advantage of the dependency tags generated by the parser so I believe excluding the parser from my pipeline is not the way to go.

Thanks!

I think there are two options here.

  1. You could set up your own custom model, maybe using pySBD, and save that to disk. You can refer to this new saved model in your ner recipes.
  2. You could write a custom recipe that takes care of the sentences in the loop. It might use something like:
import srsly 

examples = srsly.read_jsonl("path/to/file.jsonl")

def sentence_stream(example):
    # Use your own split_sentence implementation here 
    for sentence in split_sentence(example['text']):
        yield {"text": sentence} 

stream = (sentence_stream(ex) for ex in examples)

Let me know if this doesn't work or if I'm misinterpreting your problem.

I tried the first approach and it worked as expected. The text that I am working with is not well formed ( more like a bunch of sentence fragments, like text extracted from cells of a table) so dependency tags are not that useful.

So I am now using the balnk model ( blank:en) with ner.manual recipe and very satisfied with the results.

Thanks much for your help @koaning

1 Like