Nice, a 5.5MM company database? Sounds like a pretty helpful asset to have! We’re training with a low-tens-of-thousands pattern (we’re focussed geographically in this context) and it’s working fine. In principle there’s a detriment to the matcher’s speed as you add more patterns, so you might want to prioritise significant companies that are ‘making waves’ in the news; but I’d give it a go and see (I imagine you’re using SEC filings or similar for company names - in that case, there’s likely to be a lot of shell companies). Imagine there may be a way of increasing performance with addiitonal RAM/CPU if needed as well - you can always throw that at it to see.
One point on that - many companies are stupidly named (in the UK there’s a company called ‘Very’) - you may want to get rid of them as match patterns so you don’t spend your life on False positives.
As I understand it (and @ines/@honnibal would have to confirm - I’m just an amateur chipping in) - spacy models shouldn’t ever just learn match patterns, but rather things like context vectors/POS tags surrounding; as such you don’t need to worry about the patterns you start with causing overfitting. That said, after your first ner.batch-train
, you’ll have a model that thinks too many things are companies - but that’s fixed during the ner.teach
phase.
Regarding the spanning issue, you might also want to consider this: in our experience, company names in news articles rarely include the ‘ltd’ or ‘LLC’ - so you might want to make match patterns with optional suffices (otherwise you’ll miss samples that don’t end with a suffix). You can train with ‘reject’ where there’s a suffix and the match group doesn’t capture it.
One final point - as news outlets come out with ‘data’ propositions, they’re doing lots of article tagging. You might be able to scrape the tags as well (if that’s what you’re doing as I suspect) and use them to create synthetic datasets. Then you can pipe them in by using db-in
. Just make sure you create negative samples as well! (On a side note, I’m fond of synthetic data augmentation - and if you’ve got a company database you can double your sample size by choosing company names at random to replace other company names with - if ‘Google’ is a company in ‘Google has released a new product’, then ‘Acme’ is a company in ‘Acme released a new product’).