I want to first off say that I love the product, having been using it for a month or so now - it’s got tonnes of great features and a really slick UI!
I’m currently working on a project that tries to combine web scraping with NLP (identifying people identified on websites and their job titles and bios, for ~20,000 disparate websites with different structures). We’ve had some good success with a rule based approach (~70% F1 at each task), but I’m hoping that NER models could help with increasing recall (for unknown job titles/names).
One issue that I’m facing is that I don’t have framing sentences - I only want to identify names where they don’t occur in biographies -for example, ‘John Smith - CEO’ should get tagged up as a Name and a Role, but ‘John Smith joined us in September from Retailer Plc’ should be tagged as a biography. I can split at the html node level, but it’s not clear how I’d exclude/include on that basis without a model being applied (that made up biography is shorter than ‘John Smith-Hughes, Head of Product Development and Strategy, Retail and Finance Division, London’ - and the difference is in the semantics rather than the syntax). However, I have pretty sizeable source data from the initial project that I can leverage/indexes of names and roles that we can use from the initial work.
As a result I’m struggling to decide on the best approach to start off with for using prodigy for this task- as we’ve got so much data on job titles/names it seems that we should be starting with seeding or some other form of synthesised initial data to kick off the active learning (especially to teach the basic rules), but also there’s the possibility of starting from en_core_web_lg model and retraining the existing Person tag (but how then do we most effectively train it to ignore ‘framed’ tokens?). There’s also the option to start off with some sort of classifier that detects whether it’s a node with prose or a ‘ner target’ node.
I wonder if there’s any advice you’d give on the best approach for this project? Apologies for the ‘stream of consciousness’ - there are a lot of different approaches that I’m confusing myself with!