Extracting current and prior company affiliations from bios

Hi! First, if you haven't seen it yet, you might find @honnibal's talk on this topic helpful. It discusses strategies for breaking a larger NLP problem down into smaller tasks and designing label schemes, which all sounds very relevant to you.

If you're trying to predict categories like PRIOR_COMPANY and CURRENT_COMPANY, you might actually run into a similar problem as the "crime location" and "victim" example in the video: Whether "ACME Corp" is a prior company or current company isn't really inherent to the entity "ACME Corp" itself. It depends on very subtle signals and on the surrounding entities and their relationships to each other. As you've already noticed, this can be very difficult for the model to learn if you're treating it as a pure entity recognition task.

You might see better results if you take one step back and start with the generic categories first that are easier to learn: "ACME Corp" is an ORG and "Dave" is a PERSON. Even out-of-the-box, the pre-trained models probably give you a decent accuracy on this, and you can then use Prodigy to improve those further on your data, until it's very solid. You could also introduce a new category, ROLE or JOB (for "CFO").

Once you have a model that can accurately predict those general entities, you could try and resolve the relationships between them. In your examples, the syntax seems to hold most of the clues you need and that's something you can usually predict quite accurately. Here's an example of the sentence in the displaCy visualizer.

The above visualization only shows the coarse-grained tags like VERB, i.e. token.pos_. See here for the full part-of-speech tags predicted as token.tag_. For example, token.tag_ == 'VBD' lets you check for verbs in past tense.

"Dave" is the nominal subject attached to a past-tense verb with the lemma "work". From that verb, you can resolve two prepositional phrases: "at ACME Corp" (preposition "at" plus ORG entity) and "as CEO" (preposition as plus JOB entity) – these are all things you can extract programmatically if you have the part-of-speech tags, the dependency parse and relevant entity types. You might have to write a few different rules to cover the possible constructions but it'll also give you a lot more fine-grained control.

You can also keep improving the tagger and parser on your specific data, focusing on the labels you care about the most (e.g. using Prodigy's dep.teach and pos.teach). For example, making sure verbs are correctly tagged as past tense or that subjects are correctly attached. (This is also fairly easy to annotate, even without an extensive linguistic background.)

Of course, it always depends on your data and problem, so you'd have to experiment. But I do think combining more general entity types with syntax-based rules could potentially help a lot with solving your problem. For more details and inspiration, here's another thread on the concept of combining predictions and rules to solve different information extraction problems: