Custom sentence boundaries detection

Hello,
in an old post @honnibal said that i could use a NER model to detect the end of a sentence.
Basically i would like to tag it with an EOS tag and then create a custom component that set True to is_sent_start. It is quite easy.
Unfortunately i cannot use the sentencizer, i must use a custom model to detect boundaries.
I do that because my documents are too long, the OS kills the spacy process after few minutes so i must reduce the length.
The problem is that now i have the same problem training the model for EOS label. I pass the whole document to train it, and the OS kills it, again.
Can i arbitrary truncate the whole document and train the EOS label with the chunks?
For example i can create the training corpus with 10 tokens before the EOS and 10 token after ?
I read Spacy takes the previous/next 4 tokens to predict the current token, so using 10 tokens + EOS + 10 tokens should be enough, no?

Thanks

That would work, yes – although it might be a bit inefficient? Is there no way you can cut the text up into documents using some rules, that don’t over-segment? For instance, maybe you can have a rule that says you insert a sentence boundary if you have a period following by two newlines followed by a capital letter. That might make your documents short enough to process, without oversegmenting.

Otherwise, just use the sentence tokenizer in NLTK? You can use it to get the sentence boundaries, and then make a Doc object for the sentences.

@honnibal hmm i do not know if i can, i am parsing resumes/cv that have A LOT of punctuation/new lines etc. By the way, if i find some rules, can i just add True on is_sent_start using a custom component, right?

…out of curiosity, the NER component reset its weights when it met is_sent_start right? does it not “share” weigths with following sentences?

If you’re parsing resumes/CVs, how are the documents so long that you run out of memory? You’re not concatenating multiple files together are you? If so, then the answer should be pretty easy – just don’t do that.

And yes, you can add True on the is_sent_start attribute. The attribute does have some interaction with the NER, but only in that the NER is constrained to not predict entities that cross sentence boundaries.

@honnibal i think because there is noise in data, many tokens with a character only. The reason is because of the text extraction via Tika. For example i often read C U R R I C U L U M instead of the word “curriculum” the first is 10 tokens long the last only 1. It is just an example but working with text extraction give me this kind of problems. I think the problem (mermory error) with Spacy is because there are many tokens (for the reason i explained above), i think the number of the tokens cause the problem, not (only) the length of the documents. No?

I know that i should build something to normalise that data before processing it, in a perfect world i should do that, but it is not easy… i should reassemble the words, it does not sound very easy.

…oh i forgot to answer about the concatenation. At the moment i do:

for itn in range(N_ITER):
    random.shuffle(TRAIN_DATA)
    losses = {}
    # batch up the examples using spaCy's minibatch
    batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001))

    for batch in batches:

        texts, annotations = zip(*batch)

        nlp.update(
            texts,  # batch of texts
            annotations,  # batch of annotations
            drop=0.2,  # dropout - make it harder to memorise data
            losses=losses,
        )

looks familiar? :slight_smile:
Maybe i should reduce the compounding ?

@honnibal i have just checked the longest document has 273709 tokens. Too much?
I think the error occurs when the compouding reachs 32 documents per batch. Should i only decrease the stop parameter of compounding() ?

That’s a lot of tokens! Can you just exclude those really long documents? There’s probably only a couple that mess things up.

And I would try to get the cleaning right. It might be difficult but it’ll be worth it.

@honnibal ok so… i can use the whole document (instead of segmenting per sentence) but i must remove long document from the training corpus. Right?
What is a reasonable number of tokens that Spacy can handle without problems? (i have 32 GB on my pc)

@honnibal i have removed documents with more than 50.000 tokens. I will let you know if the problem persists.

…after few seconds it reached 13GB of memory usage.

UPDATE: 18.5GB