Size of context window for NLP

magdaaniol · October 8, 2024, 11:23am

Here's a related post on this forum: Changing the window size of a NER model

In general, to have more direct control over the CNN architecture etc. it would probably be easier to export your annotated dataset with data-to-spacy and continue with training experiments directly with spaCy.

On another note, "encouraging natural reading order" sounds like a perfect job for an LLM. Maybe you could add a step in your preprocessing of LLM "fixing" the OCR output? The disadvantage of this approach is that in production you'd need to apply LLM as well and the output will not be deterministic but I think it would still generate a very plausible input very similar to the input used in the training.

Topic		Replies	Views
Size of the raw text in the source file usage , ner	2	548	November 4, 2019
Review Approaches to NER on Unstructured Data (and Discussing Amazon Comprehend vs spaCy + Prodigy) ner , spacy , aws	6	1169	August 2, 2022
Changing the window size of a NER model usage , ner , spacy	2	676	April 11, 2020
Splitting bigger documents for NER usage , ner , best-practices	1	942	March 30, 2022
Information Extraction for long, semi-unique documents ner	1	534	October 16, 2019

Size of context window for NLP

Related topics