Hello,
in an old post @honnibal said that i could use a NER model to detect the end of a sentence.
Basically i would like to tag it with an EOS tag and then create a custom component that set True to is_sent_start. It is quite easy.
Unfortunately i cannot use the sentencizer, i must use a custom model to detect boundaries.
I do that because my documents are too long, the OS kills the spacy process after few minutes so i must reduce the length.
The problem is that now i have the same problem training the model for EOS label. I pass the whole document to train it, and the OS kills it, again.
Can i arbitrary truncate the whole document and train the EOS label with the chunks?
For example i can create the training corpus with 10 tokens before the EOS and 10 token after ?
I read Spacy takes the previous/next 4 tokens to predict the current token, so using 10 tokens + EOS + 10 tokens should be enough, no?
That would work, yes – although it might be a bit inefficient? Is there no way you can cut the text up into documents using some rules, that don’t over-segment? For instance, maybe you can have a rule that says you insert a sentence boundary if you have a period following by two newlines followed by a capital letter. That might make your documents short enough to process, without oversegmenting.
Otherwise, just use the sentence tokenizer in NLTK? You can use it to get the sentence boundaries, and then make a Doc object for the sentences.
@honnibal hmm i do not know if i can, i am parsing resumes/cv that have A LOT of punctuation/new lines etc. By the way, if i find some rules, can i just add True on is_sent_start using a custom component, right?
If you’re parsing resumes/CVs, how are the documents so long that you run out of memory? You’re not concatenating multiple files together are you? If so, then the answer should be pretty easy – just don’t do that.
And yes, you can add True on the is_sent_start attribute. The attribute does have some interaction with the NER, but only in that the NER is constrained to not predict entities that cross sentence boundaries.
@honnibal i think because there is noise in data, many tokens with a character only. The reason is because of the text extraction via Tika. For example i often read C U R R I C U L U M instead of the word “curriculum” the first is 10 tokens long the last only 1. It is just an example but working with text extraction give me this kind of problems. I think the problem (mermory error) with Spacy is because there are many tokens (for the reason i explained above), i think the number of the tokens cause the problem, not (only) the length of the documents. No?
I know that i should build something to normalise that data before processing it, in a perfect world i should do that, but it is not easy… i should reassemble the words, it does not sound very easy.
…oh i forgot to answer about the concatenation. At the moment i do:
for itn in range(N_ITER):
random.shuffle(TRAIN_DATA)
losses = {}
# batch up the examples using spaCy's minibatch
batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001))
for batch in batches:
texts, annotations = zip(*batch)
nlp.update(
texts, # batch of texts
annotations, # batch of annotations
drop=0.2, # dropout - make it harder to memorise data
losses=losses,
)
looks familiar?
Maybe i should reduce the compounding ?
@honnibal i have just checked the longest document has 273709 tokens. Too much?
I think the error occurs when the compouding reachs 32 documents per batch. Should i only decrease the stop parameter of compounding() ?
@honnibal ok so… i can use the whole document (instead of segmenting per sentence) but i must remove long document from the training corpus. Right?
What is a reasonable number of tokens that Spacy can handle without problems? (i have 32 GB on my pc)