ner.manual training pauses indefinitely every 10 saves?

Well, tried to post about this issue a bit ago but the system decided that was spam and seems to have removed or hidden it.

Anyway, I'm trying to train with ner.manual with a 2.5M set of patterns to pre-label an 8MB jsonl file containing "text" sets.

Every 10 "Accepts" that I do, I get the "Loading..." view, and it stays that way indefinitely. I usually hit Save, and then shift+reload, and wait a long while, and then finally it loads the next one to label.

Any ideas?

This is on an M1 Max Mac with 64GB running the latest macOS and Python 3.9. Latest version of prodigy/spacy installed.

This forum runs discourse which comes with a SPAM detector. It's certainly not terrible, but I can confirm that it sometimes has trouble distinguishing between spam and developer terms. We try to manually correct it when we spot it but we're not always on time. Sorry about that.

My first thought with the Loading .. screen is that Prodigy is going through each example in your .jsonl file trying to find a pattern match. Given that you have 2.5M patterns, it can take a while before it's able to parse a single example.

Is there anything you can share about the task? I'm mainly curious why you need 2.5M patterns.

A possible alternative is to run the patterns yourself offline. You could filter the .jsonl file upfront so that you don't need to wait for the filtering while you're labelling.

Thanks for the reply!

So just so I understand, each time it goes to load a new set of documents to label, it's running the entire list of patterns against those 10-20 docs, before loading them up?

I could probably find ways to narrow the pattern list, this was just my first attempt with NER, so I was experimenting with an exhaustive list of multi-word patterns, in a variety of orderings. Sounds like this isn't practical, so I'll need to re-think the approach.

One thing that's not totally clear yet, am I able to run the same ner.manual command over and over, and if the parameters are all the same, it will continue to build on the existing database that was previously started? (I.e. not overwrite and start from scratch?) The confusion was that I wasn't sure if I should only use ner.manual once, and then from then on be using ner.correct?

To put it very simply; the .jsonl file isn't read into memory all at once. It's more loaded like a Python generator where items are picked one-by-one. If the current item matches the pattern it's added to the batch. Once there's a batch of 10 items (or whatever you've configured) the batch is sent to the front end. I'm glossing over some details here because Prodigy also checks if the item is labelled before, but this is the gist.

One thing that's not totally clear yet, am I able to run the same ner.manual command over and over, and if the parameters are all the same, it will continue to build on the existing database that was previously started?

Yes, unless you're doing something fancy with custom recipes. This is because data inputs and labelling tasks are hashed before they're considered to be candidates and they are compared to already existing labels. This is a mechanism to prevent duplicates from getting into the database. More details on the hashing can be found in the Prodigy docs here.

A final question; what kind of patterns are you using? Very complex regexes or patterns with parts of speech? The reason why I'm mentioning it is because you might be able to get a speedup by only doing string-matching. I'll take the example from the Prodigy docs here to explain.

{"label": "FRUIT", "pattern": [{"lower": "apple"}]}
{"label": "FRUIT", "pattern": [{"lower": "goji"}, {"lower": "berry"}]}
{"label": "VEGETABLE", "pattern": [{"lower": "squash", "pos": "NOUN"}]}
{"label": "VEGETABLE", "pattern": "Lamb's lettuce"}

You'll notice in this final pattern there's no list but a string. These strings are fed internally to spaCy's PhraseMatcher which is typically faster than the Token Matcher.

Thank you! Switching to phrase matching cut RAM usage to a 1/5th of what it was before, and there's no delay in loading new sets any more!

1 Like

Happy to hear it!