My text input to Spacy is already in one sentence per line format. So I would like to switch off the sentence boundary detection in the parser.
Is there any config setting that controls the sentence boundary detection by the parser ? If not, is there a work around I can employ to let the parser assign dependency tags but not do sentence boundary detection ?
I would like to take advantage of the dependency tags generated by the parser so I believe excluding the parser from my pipeline is not the way to go.
You could set up your own custom model, maybe using pySBD, and save that to disk. You can refer to this new saved model in your ner recipes.
You could write a custom recipe that takes care of the sentences in the loop. It might use something like:
import srsly
examples = srsly.read_jsonl("path/to/file.jsonl")
def sentence_stream(example):
# Use your own split_sentence implementation here
for sentence in split_sentence(example['text']):
yield {"text": sentence}
stream = (sentence_stream(ex) for ex in examples)
Let me know if this doesn't work or if I'm misinterpreting your problem.
I tried the first approach and it worked as expected. The text that I am working with is not well formed ( more like a bunch of sentence fragments, like text extracted from cells of a table) so dependency tags are not that useful.
So I am now using the balnk model ( blank:en) with ner.manual recipe and very satisfied with the results.