Fact extraction for earnings news

Hi and welcome! :smiley:

This is an interesting project and definitely sounds like something that can be solved by NLP. Large-scale information extraction (including things like "populate a database from free-form text") is a use case where NLP really shines and something that is already working very well across research and production. I think what it really comes down to is breaking the larger, abstract goal down into smaller machine learning tasks, and finding out what works best for each individual component.

If you haven't seen it already, you might find @honnibal's talk on how to approach NLP problems helpful. The examples around 11:38 are especially relevant, because they show similar information extraction task and common pitfalls and solutions:

I'd recommend starting with generic categories and building on top of them to extract more specific information. spaCy's pre-trained English models can already predict many of the entity types that are relevant to your problem – e.g. ORG, MONEY, TIME and even things like PERCENT. It probably won't work perfectly on your data out-of-the-box, since earnings reports are pretty different from the general news and web corpus the models were trained on. But it's a good starting point and you can leverage this for your own custom model.

The ner.teach recipe in Prodigy is designed to help you fine-tune a pre-trained model by giving it feedback on its predictions. So you could start with the pre-trained English model, pick one label (e.g. ORG), load in your earnings reports and start clicking accept or reject. The suggestions you'll see are the ones the model is most uncertain about, e.g. the ones with a score closest to 0.5. (The idea here is that this gives you the examples that make the most difference for training). Once you're done, you can run ner.batch-train to see how the model improves with the new data.

Another labelling strategy could be to take an existing pre-trained model, get its predictions for a given set of entity types and correct them manually. This is much faster than labelling everything from scratch, because even if your model only gets 50% right, you'll only have to correct the rest. That workflow is built into Prodigy as the ner.make-gold recipe.

Of course it's difficult to give a definitive answer here – you'll have to run a few experiments and try things. But thinking of it as a pipeline of different components the build on top of each other definitely makes sense. Here's a possible strategy:

  1. Use the entity recognizer to extract the underlying generic concepts like company names and money. You can take advantage of pre-trained models and fine-tune them on your data, which should be significantly more efficient than doing everything from scratch.
  2. Add rules wherever rules are better than statistical models. For example, I could imagine that things like "Q2 2018" might be easier to extract by writing a few match patterns. spaCy's rule-based matcher lets you write patterns similar to regular expressions, but using tokens and their attributes (including lemmas, part-of-speech tags, dependencies).
  3. Use the dependency parser to extract relationships between the spans extracted in the previous steps. For example, there might be many mentions of amounts of money, but you only care about the ones that are about sales. By looking at the relationships between the entities and the surrounding words (subject/object, is it attached to a specific verb, etc.), you'll be able to get a better idea of whether it's information you care about or not. Again, you can take advantage of pre-trained models here.
  4. Maybe train a text classification component to assign labels to whole sentences or paragraphs. This might work well for reports that are less information-dense and/or long-winded. A text classifier can also be helpful to distinguish between relevant information and "noise" – for example, if you're dealing with reports that include both company finances and other stuff, you could train a classifier to predict whether the text is about company finances and filter out the rest before you analyse the data further.

To elaborate a bit on combining statistical predictions and rules, and taking advantage of the dependency parser and tagger (which is often very underappreciated IMO), here's a simplified sentence from one of your examples in our displaCy dependency visualizer:

And here's the same thing in code:

doc = nlp("Sales totaled 864 million")
print([(token.text, token.pos_, token.dep_) for token in doc])
# [('Sales', 'NOUN', 'nsubj'), ('totaled', 'VERB', 'ROOT'), 
#  ('864', 'NUM', 'compound'), ('million', 'NUM', 'dobj')]

If you know that the tokens "864 million" are a MONEY entity (e.g. because your entity recognizer predicted it), you can walk up the tree and check how it attaches to the rest of the sentence: in this case, the phrase is a direct object attached to a verb with the lemma "total" (to total) and the subject of the sentence is "sales". Once you start working with your data, you might find a lot of similar patterns that let you cover a large number of relevant sentences.

Again, if you just run the pre-trained model on your text out-of-the-box, you might find that the predictions aren't that great, because your texts are quite specific. But you can fine-tune those as well, just like you do with the entity recognizer. In Prodigy, you can use the pos.teach and dep.teach recipes to fine-tune the part-of-speech tagger and dependency parser. I wouldn't bother too much with all the random obscure labels and focus on the most important ones: nouns and verbs, and subject/object relationships. If the model gets those right, you'll be able to work with that and the rest will fall into place much more easily. NOUN and VERB are also pretty easy to annotate, even without a deep background in linguistics.

Btw, focusing on training the generic components also means that you'll end up with a pretty powerful base model with a part-of-speech tagger, dependency parser and named entity recognizer fine-tuned on company reports. Having that in your toolbox will be incredibly useful, even if you end up needing to analyse different things in the future. If you have the basics covered, you can always add different rules and components on top of them.

One thing that's important to consider is that NLP models can generally only look at a very narrow context window. This is especially true for named entity recognition and short text classification – but even for long text classification, a common strategy is to classify shorter fragments and average over the predictions.

So as a rule of thumb, we often recommend the following: If you are not able to make the annotation decision based on the local context (e.g. one paragraph), the model is very unlikely to be able to learn anything meaningful from that decision.

Of course that doesn't mean that your problem is impossible to solve – it might just need a slightly different approach that leverages what's possible/easy to predict and takes into account the limitations. I'm not an expert in earnings reports, but let's assume you have a lot of reports that kinda look like the Electrolux one. The information about the period / time frame is encoded in a headline and everything following that headline implicitly refers to the second quarter. So one approach could be this:

  1. Detect the headlines. Maybe you only need rules here, because headlines are pretty distinctly different from the rest of the text. Maybe it's better to train a text classifier to predict HEADLINE. This depends on the data and is something you need to experiment with.
  2. Associate text with headlines. This one is hopefully easy – we can assume that all text up to the next headline belongs to the previous headline.
  3. Detect whether a headline references a quarter and if so, normalize that to a structured representation. For example, you might want a function that does this: "second quarter of 2018" → {'q': 2, 'year': 2018}. Custom attributes in spaCy are a great way to attach this type of data to documents and paragraphs btw.

As I said, this is just an idea that came to mind, and I'm obviously not a domain expert :stuck_out_tongue: But I hope it illustrates the kind of thought process that could go into developing a strategy like this.

You'll still need to try it – but I hope Prodigy makes it easy to run these types of experiments and validate your ideas. This was also one of our main motivations when developing the tool: the "hard part" is figuring out what works, so you want to be running lots of small experiments on smaller datasets and do this as fast and as efficiently as possible.

2 Likes