To put it very simply; the .jsonl
file isn't read into memory all at once. It's more loaded like a Python generator where items are picked one-by-one. If the current item matches the pattern it's added to the batch. Once there's a batch of 10 items (or whatever you've configured) the batch is sent to the front end. I'm glossing over some details here because Prodigy also checks if the item is labelled before, but this is the gist.
One thing that's not totally clear yet, am I able to run the same ner.manual command over and over, and if the parameters are all the same, it will continue to build on the existing database that was previously started?
Yes, unless you're doing something fancy with custom recipes. This is because data inputs and labelling tasks are hashed before they're considered to be candidates and they are compared to already existing labels. This is a mechanism to prevent duplicates from getting into the database. More details on the hashing can be found in the Prodigy docs here.
A final question; what kind of patterns are you using? Very complex regexes or patterns with parts of speech? The reason why I'm mentioning it is because you might be able to get a speedup by only doing string-matching. I'll take the example from the Prodigy docs here to explain.
{"label": "FRUIT", "pattern": [{"lower": "apple"}]}
{"label": "FRUIT", "pattern": [{"lower": "goji"}, {"lower": "berry"}]}
{"label": "VEGETABLE", "pattern": [{"lower": "squash", "pos": "NOUN"}]}
{"label": "VEGETABLE", "pattern": "Lamb's lettuce"}
You'll notice in this final pattern there's no list but a string. These strings are fed internally to spaCy's PhraseMatcher which is typically faster than the Token Matcher.