I have some digital PDFs where I extract tables to pandas dataframe. It works quite good but sometimes there are whitespaces that shouldn't be there, e.g.
Cos t of s a l es
Res ea rch a nd devel opment cos ts
Since I'm able to decipher where the whitespaces should be, then maybe I can create some
spaCy component that are able to as well but I'm not sure how. I could make it a textcat task for each possible merge of tokens but it depends on the whole phrase really - not just the merged tokens. I'm not sure if this forum is the right place to ask though - could be that stackoverflow is a better choice?