I’m sure this is something that’s been asked but I’m having trouble finding guidance on this use-case:
We’re looking to identify sub-types of entities - in our use case, looking at a biography i.e. 'Dave worked at ACME Corp as CEO and now serves as CFO of Wallmart" and identifying the current/prior companies and roles separately (so in this example, the correct spans would be “ACME Corp”: prior_company, “CEO”: prior_role, “Wallmart”: current_company, “CFO”: current_role).
Obviously these aren’t strictly entity types: they’re a combination of the company/role entity type and a tense. My question is, which recipe should we use/how should we organise the process?
Currently we’re training 4 new entity types (as above), and we’re getting good accuracy between the entity types (role vs company), but the NER is getting confused between present/past tenses. Is there a better way that you’d recommend of structuring the problem? (or alternatively, do we need to retrain some vectors to ensure there’s a clear past/present distinction in the embeddings?)
Thanks!
Hi! First, if you haven't seen it yet, you might find @honnibal's talk on this topic helpful. It discusses strategies for breaking a larger NLP problem down into smaller tasks and designing label schemes, which all sounds very relevant to you.
If you're trying to predict categories like PRIOR_COMPANY and CURRENT_COMPANY, you might actually run into a similar problem as the "crime location" and "victim" example in the video: Whether "ACME Corp" is a prior company or current company isn't really inherent to the entity "ACME Corp" itself. It depends on very subtle signals and on the surrounding entities and their relationships to each other. As you've already noticed, this can be very difficult for the model to learn if you're treating it as a pure entity recognition task.
You might see better results if you take one step back and start with the generic categories first that are easier to learn: "ACME Corp" is an ORG and "Dave" is a PERSON. Even out-of-the-box, the pre-trained models probably give you a decent accuracy on this, and you can then use Prodigy to improve those further on your data, until it's very solid. You could also introduce a new category, ROLE or JOB (for "CFO").
Once you have a model that can accurately predict those general entities, you could try and resolve the relationships between them. In your examples, the syntax seems to hold most of the clues you need and that's something you can usually predict quite accurately. Here's an example of the sentence in the displaCy visualizer.
The above visualization only shows the coarse-grained tags like VERB, i.e. token.pos_. See here for the full part-of-speech tags predicted as token.tag_. For example, token.tag_ == 'VBD' lets you check for verbs in past tense.
"Dave" is the nominal subject attached to a past-tense verb with the lemma "work". From that verb, you can resolve two prepositional phrases: "at ACME Corp" (preposition "at" plus ORG entity) and "as CEO" (preposition as plus JOB entity) – these are all things you can extract programmatically if you have the part-of-speech tags, the dependency parse and relevant entity types. You might have to write a few different rules to cover the possible constructions but it'll also give you a lot more fine-grained control.
You can also keep improving the tagger and parser on your specific data, focusing on the labels you care about the most (e.g. using Prodigy's dep.teach and pos.teach). For example, making sure verbs are correctly tagged as past tense or that subjects are correctly attached. (This is also fairly easy to annotate, even without an extensive linguistic background.)
Of course, it always depends on your data and problem, so you'd have to experiment. But I do think combining more general entity types with syntax-based rules could potentially help a lot with solving your problem. For more details and inspiration, here's another thread on the concept of combining predictions and rules to solve different information extraction problems:
That’s super helpful, and thanks for the quick response! So NEs depending on present tense verbs must be current roles, and past tense must be prior roles? It sounds so simple that I’m embarassed to have missed it!
No worries and definitley keep us updated on how you go. I'm actually very interested in this one because it'd also make a great example for a talk or blog post (if you don't mind us using the example, of course!).
I mean, language isn't always that simple and I'm sure there are plenty of exceptions or edge cases that you'll come across once you start writing that logic. But you can probably identify a number of trigger verbs and phrases that are very strong indicators – like lemma "work" + preposition "at" or lemma "serve" plus preposition "as" or lemma "be" + "employed" and so on. Then you can check the subtree for entity types like person, organisation, job etc. Or maybe you'll have constructions like "[PERSON], former [JOB] of [ORG], ...", which you can also extract with a rule-based system. You might find this docs section on using the dependency parse helpful.
Once you have something in place, you can also use Prodigy to evaluate your rules. Run them over lots of text, resolve it to the structured format and display text + extracted info to the annotator. Then accept / reject and at the end, calculate the accuracy. You can then also check out the rejected examples to see where your rules went wrong, and fix them. And you can re-run the evaluation periodically to find out whether a change to the rules improves the overall accuracy or not.
That’s really useful - so our pipeline becomes: POS -> NER -> DEP -> Rules, where rules is a set of logic (rather than using ML) to identify the relevant verbs and their POS (for tensing), and the dependent NEs are extracted? I can see what you mean about the PyData talk - pretty useful in this context!
Happy for you to use our experience if you’d like! Essentially we want to extract career histories from bios (as you might expect) within a recruitment context; the good thing about the project is we start with loads of scraped training data that’s very accurate for NER, but extending the pipeline is trickier and its really helpful to get the advice!