Custom sentence boundaries detection

damiano · June 26, 2019, 10:41am

Hello,
in an old post @honnibal said that i could use a NER model to detect the end of a sentence.
Basically i would like to tag it with an EOS tag and then create a custom component that set True to is_sent_start. It is quite easy.
Unfortunately i cannot use the sentencizer, i must use a custom model to detect boundaries.
I do that because my documents are too long, the OS kills the spacy process after few minutes so i must reduce the length.
The problem is that now i have the same problem training the model for EOS label. I pass the whole document to train it, and the OS kills it, again.
Can i arbitrary truncate the whole document and train the EOS label with the chunks?
For example i can create the training corpus with 10 tokens before the EOS and 10 token after ?
I read Spacy takes the previous/next 4 tokens to predict the current token, so using 10 tokens + EOS + 10 tokens should be enough, no?

Thanks

honnibal · June 26, 2019, 1:10pm

That would work, yes – although it might be a bit inefficient? Is there no way you can cut the text up into documents using some rules, that don’t over-segment? For instance, maybe you can have a rule that says you insert a sentence boundary if you have a period following by two newlines followed by a capital letter. That might make your documents short enough to process, without oversegmenting.

Otherwise, just use the sentence tokenizer in NLTK? You can use it to get the sentence boundaries, and then make a Doc object for the sentences.

damiano · June 26, 2019, 1:44pm

@honnibal hmm i do not know if i can, i am parsing resumes/cv that have A LOT of punctuation/new lines etc. By the way, if i find some rules, can i just add True on is_sent_start using a custom component, right?

damiano · June 26, 2019, 1:46pm

…out of curiosity, the NER component reset its weights when it met is_sent_start right? does it not “share” weigths with following sentences?

honnibal · June 26, 2019, 2:15pm

If you’re parsing resumes/CVs, how are the documents so long that you run out of memory? You’re not concatenating multiple files together are you? If so, then the answer should be pretty easy – just don’t do that.

And yes, you can add True on the is_sent_start attribute. The attribute does have some interaction with the NER, but only in that the NER is constrained to not predict entities that cross sentence boundaries.

damiano · June 26, 2019, 2:35pm

@honnibal i think because there is noise in data, many tokens with a character only. The reason is because of the text extraction via Tika. For example i often read C U R R I C U L U M instead of the word “curriculum” the first is 10 tokens long the last only 1. It is just an example but working with text extraction give me this kind of problems. I think the problem (mermory error) with Spacy is because there are many tokens (for the reason i explained above), i think the number of the tokens cause the problem, not (only) the length of the documents. No?

I know that i should build something to normalise that data before processing it, in a perfect world i should do that, but it is not easy… i should reassemble the words, it does not sound very easy.

damiano · June 26, 2019, 2:37pm

…oh i forgot to answer about the concatenation. At the moment i do:

for itn in range(N_ITER):
    random.shuffle(TRAIN_DATA)
    losses = {}
    # batch up the examples using spaCy's minibatch
    batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001))

    for batch in batches:

        texts, annotations = zip(*batch)

        nlp.update(
            texts,  # batch of texts
            annotations,  # batch of annotations
            drop=0.2,  # dropout - make it harder to memorise data
            losses=losses,
        )

looks familiar?
Maybe i should reduce the compounding ?

damiano · June 26, 2019, 3:28pm

@honnibal i have just checked the longest document has 273709 tokens. Too much?
I think the error occurs when the compouding reachs 32 documents per batch. Should i only decrease the stop parameter of compounding() ?

honnibal · June 27, 2019, 7:59am

That’s a lot of tokens! Can you just exclude those really long documents? There’s probably only a couple that mess things up.

And I would try to get the cleaning right. It might be difficult but it’ll be worth it.

damiano · June 27, 2019, 8:56am

@honnibal ok so… i can use the whole document (instead of segmenting per sentence) but i must remove long document from the training corpus. Right?
What is a reasonable number of tokens that Spacy can handle without problems? (i have 32 GB on my pc)

damiano · June 27, 2019, 12:29pm

@honnibal i have removed documents with more than 50.000 tokens. I will let you know if the problem persists.

…after few seconds it reached 13GB of memory usage.

UPDATE: 18.5GB

Topic		Replies	Views
ValueError: [E030] Sentence boundaries unset. spacy	1	703	March 2, 2022
Train dependency parser to detect sentences boundaries usage , spacy , solved , dep	4	820	May 3, 2019
Disable sentence boundary detection in Spacy Parser spacy	2	396	February 19, 2023
Recipe ner.batch-train results in ValueError: [E030] usage , ner , spacy , solved	10	2446	June 25, 2019
Is there a limitation for string length for NER spacy models? usage , ner , spacy	1	1504	October 31, 2018

Custom sentence boundaries detection

Related topics