Web NER

Hi There,

I want to first off say that I love the product, having been using it for a month or so now - it’s got tonnes of great features and a really slick UI!

I’m currently working on a project that tries to combine web scraping with NLP (identifying people identified on websites and their job titles and bios, for ~20,000 disparate websites with different structures). We’ve had some good success with a rule based approach (~70% F1 at each task), but I’m hoping that NER models could help with increasing recall (for unknown job titles/names).

One issue that I’m facing is that I don’t have framing sentences - I only want to identify names where they don’t occur in biographies -for example, ‘John Smith - CEO’ should get tagged up as a Name and a Role, but ‘John Smith joined us in September from Retailer Plc’ should be tagged as a biography. I can split at the html node level, but it’s not clear how I’d exclude/include on that basis without a model being applied (that made up biography is shorter than ‘John Smith-Hughes, Head of Product Development and Strategy, Retail and Finance Division, London’ - and the difference is in the semantics rather than the syntax). However, I have pretty sizeable source data from the initial project that I can leverage/indexes of names and roles that we can use from the initial work.

As a result I’m struggling to decide on the best approach to start off with for using prodigy for this task- as we’ve got so much data on job titles/names it seems that we should be starting with seeding or some other form of synthesised initial data to kick off the active learning (especially to teach the basic rules), but also there’s the possibility of starting from en_core_web_lg model and retraining the existing Person tag (but how then do we most effectively train it to ignore ‘framed’ tokens?). There’s also the option to start off with some sort of classifier that detects whether it’s a node with prose or a ‘ner target’ node.

I wonder if there’s any advice you’d give on the best approach for this project? Apologies for the ‘stream of consciousness’ - there are a lot of different approaches that I’m confusing myself with!

Thanks!

Hey,

Glad to hear you’re liking Prodigy!

I think situations like yours are actually quite a common use-case: the data is “somewhat” structured, making it difficult to get the rules just right – but the data is also not really running text, making it difficult to see how the models the research community are focused on can be best applied.

My usual advice is to think about trying to “factorize” the problem into distinct chunks of information. The idea is you want the new machine learning model you’re creating to be learning the minimum number of bits based on the maximum amount of evidence, while rule-based or generic components add the additional information required to get the job done.

Perhaps in your case you could have a text classification model that learns whether the chunk of text is a “biography”. Then you can filter out these cases, and use your rules on the more relevant examples?

Another trick might be to learn to recognise segmentation markers, rather than the names and roles. For instance, you might tag the dash - in the example above as a separator. If there’s a smaller number of separators than there are names and titles, it might be easier to learn that category. This is just a thought though – maybe it won’t be the easiest way to do things.

That’s really helpful, thanks! I think I’d been attempting to factorize, but saw so many options and subtasks that I’d tangled myself up.

I like both of the above ideas actually - maybe using rules to split (I use the nodes to split anyway so some of it is already done), and then classifying/using NER on the split components. That’s a neat idea and I’ll check it out!

Thanks!