I think a BiLSTM model is probably a good idea, although you might have problems depending on how long your texts are.
You could try to true-case the text by, for each word, picking the form that has the highest probability: lower-case, title-case or upper-case. Something like this:
forms = [token.text.lower(), token.text.upper(), token.text.upper() + token.text[1:].lower()]
probs = [token.vocab[form].prob for form in forms]
prob, form = max(zip(probs, forms))
This will probably help you get better performance from the tagger and parser, which might help you normalize further.
Another strategy will be to run spaCy over normal text to get the predicted POS tags and dependencies, and then corrupt the text so that it’s like yours. Then train a model that has to predict those parses given the corrupted text. You could then run this new model over your documents as they are.
Sentence splitting will generally assume the text is in a reasonably normalized form, so the articles are unlikely to help you.
It’s probably better to just normalize the text: fix the capitalisation, predict the punctuation and add it back in, etc. If you do know a word is the end of a sentence, you can set following token’s token.is_sent_start attribute to True. This forces spaCy to predict a sentence boundary at that token, and prevents the NER model from predicting an entity that spans over that token.
That idea crossed my mind. But I don’t exactly know how sentence splitting works by default. At this point I understood that it happens after POS tagging and probably inside the “parser” pipe (is it so?). But what exactly is taken into account is not clear.
If I understand how it works, I can try and trick the default sentence splitter into splitting my text as a normal text.
From what I understand, there are 3 main “features” used for the splitting: shape, punctuation token (.) and dependencies. Am I missing something?
My current idea is to train the model to recognise punctuation better (along with other POS) and then potentially lowercase the whole text. Depending on how the sentence splitting works, I might just retrain POS tagging on my text using prodigy.
After that I will implement some rule-based sentence splitting (there are some rules that I can strictly define) and finally the default sentence splitting.
The decision is made jointly with the syntactic structure. So, it’s trying to figure out where the two trees are rooted, and where they’re not connected. That division point between the two trees becomes the sentence boundary. So, in terms of what’s taken into account: a lot of things. Each word gets a vector determined by a window of up to 4 words on either side, and then the parser maintains a stack and a partial parse to determine the next action to take at each word.
Those are the features used in the word vector calculation. But the parser model is basically a push-down automaton that maintains a state, and has actions it can use to manipulate the state, to ultimately output the tree. One of those actions is “insert sentence boundary”. The current state of the automaton is used to calculate the features to determine the next action.
Well, it’s hard to say! I think it’ll be a fairly time-consuming process, and there’s a big risk of getting nothing useful out at the end. Like, you should probably expect to spend a few weeks trying out your current ideas, with maybe a 20-50% chance of success?