Size of context window for NLP

We are using Natural Language Processing to look at text stream from pdfs to identify entities etc. From my reading, my understanding is that It relies on the context which is basically a small number of neighbouring words (?? 4).
Its going quite well, but extracting text from pdfs can muddle up the word order, and sometimes useful neighbouring words are further away. I have adjusted my extraction technique somewhat to encourage natural reading order, but this only works to a degree.
I am wondering whether a larger context might help accuracy. Is there a way to change the number of neighbouring words the model looks at?

Hi @alphie ,

Here's a related post on this forum: Changing the window size of a NER model

In general, to have more direct control over the CNN architecture etc. it would probably be easier to export your annotated dataset with data-to-spacy and continue with training experiments directly with spaCy.

On another note, "encouraging natural reading order" sounds like a perfect job for an LLM. Maybe you could add a step in your preprocessing of LLM "fixing" the OCR output? The disadvantage of this approach is that in production you'd need to apply LLM as well and the output will not be deterministic but I think it would still generate a very plausible input very similar to the input used in the training.

Thank you, these are helpful suggestions. Before I take the leap to

data-to-spacy

I need to do some rationalisation and merging of datasets (which I know how to do). Is there a quick way to find out the number of entries in each dataset and the date it was created - I need to do a bit of thinking on which datasets are the ones to merge.

I know

Prodigy stats -l

gives me the datasets, but it would be great to know how many entries are in each and ideally the date I created that dataset.

Many thanks in advance. Alphie

Hi @alphie,

You can call prodigy stats on a single dataset and that would give you all info you need, I think. Here's the example output:

============================== ✨  Dataset Stats ==============================

Dataset       test
Created       2024-10-14 19:02:20
Description   None
Author        None
Annotations   6
Accept        6
Reject        0
Ignore        0

To automate it, you can first call prodigy stats -l -nf to get the list of all datasets more easily from the json format and then call the stats on each dataset in a loop.
This one liner is horrific but it does the job (heads up it uses jq tool)

initial_output=$(python -m prodigy stats -l -nf);echo "Dataset-specific stats:"; echo "----------------------"; for dataset in $(echo "$initial_output" | jq -r '.datasets[]'); do echo "Stats for dataset: $dataset"; python -m prodigy stats "$dataset"; echo; done

Wow - that is quite a line of code! Thank you!