Size of context window for NLP

Hi @alphie ,

Here's a related post on this forum: Changing the window size of a NER model

In general, to have more direct control over the CNN architecture etc. it would probably be easier to export your annotated dataset with data-to-spacy and continue with training experiments directly with spaCy.

On another note, "encouraging natural reading order" sounds like a perfect job for an LLM. Maybe you could add a step in your preprocessing of LLM "fixing" the OCR output? The disadvantage of this approach is that in production you'd need to apply LLM as well and the output will not be deterministic but I think it would still generate a very plausible input very similar to the input used in the training.