I am new to the community but I really think SpaCy and prodigy looks promising for the challenge I am facing. I will outline my project and then list some questions to that.
I want to be able to extract facts from an earnings report like example one and example two.
From those two examples I want to extract something sales. I.e.
Example one: Electrolux Q2 Sales 31,354m SEK
Example two: Metso Q2 Sales 786m EUR
Note in the second example sales are given for both first half year and second quarter. I only want to extract the quarterly facts.
My questions are
Is SpaCy and prodigy the right tools to approach the challenge?
In a very broad way; what is the best approach/pipeline to solve it? I imagine apply preprocessing, setup training in prodigy to train a NER model (to learn Sales). I also need to classify whether the paragraph is regarding the quarter or half-year period but the challenge is that the information might not be gived by the paragraph only.
How would one prepare the data for labeling and what is the best way to label in prodigy? Would you label both sales and period (like second quarter, Q2 etc.)? Would you rather label all numbers listed as candidates for sales and then apply a classifier?
I know it’s very broad questions but I want to make sure that I don’t follow the wrong path from the start. Any help is much appreciated and I am very eager to discuss the best design/approach with anyone. Thank you.
This is an interesting project and definitely sounds like something that can be solved by NLP. Large-scale information extraction (including things like "populate a database from free-form text") is a use case where NLP really shines and something that is already working very well across research and production. I think what it really comes down to is breaking the larger, abstract goal down into smaller machine learning tasks, and finding out what works best for each individual component.
If you haven't seen it already, you might find @honnibal's talk on how to approach NLP problems helpful. The examples around 11:38 are especially relevant, because they show similar information extraction task and common pitfalls and solutions:
I'd recommend starting with generic categories and building on top of them to extract more specific information. spaCy's pre-trained English models can already predict many of the entity types that are relevant to your problem – e.g. ORG, MONEY, TIME and even things like PERCENT. It probably won't work perfectly on your data out-of-the-box, since earnings reports are pretty different from the general news and web corpus the models were trained on. But it's a good starting point and you can leverage this for your own custom model.
The ner.teach recipe in Prodigy is designed to help you fine-tune a pre-trained model by giving it feedback on its predictions. So you could start with the pre-trained English model, pick one label (e.g. ORG), load in your earnings reports and start clicking accept or reject. The suggestions you'll see are the ones the model is most uncertain about, e.g. the ones with a score closest to 0.5. (The idea here is that this gives you the examples that make the most difference for training). Once you're done, you can run ner.batch-train to see how the model improves with the new data.
Another labelling strategy could be to take an existing pre-trained model, get its predictions for a given set of entity types and correct them manually. This is much faster than labelling everything from scratch, because even if your model only gets 50% right, you'll only have to correct the rest. That workflow is built into Prodigy as the ner.make-gold recipe.
Of course it's difficult to give a definitive answer here – you'll have to run a few experiments and try things. But thinking of it as a pipeline of different components the build on top of each other definitely makes sense. Here's a possible strategy:
Use the entity recognizer to extract the underlying generic concepts like company names and money. You can take advantage of pre-trained models and fine-tune them on your data, which should be significantly more efficient than doing everything from scratch.
Add rules wherever rules are better than statistical models. For example, I could imagine that things like "Q2 2018" might be easier to extract by writing a few match patterns. spaCy's rule-based matcher lets you write patterns similar to regular expressions, but using tokens and their attributes (including lemmas, part-of-speech tags, dependencies).
Use the dependency parser to extract relationships between the spans extracted in the previous steps. For example, there might be many mentions of amounts of money, but you only care about the ones that are about sales. By looking at the relationships between the entities and the surrounding words (subject/object, is it attached to a specific verb, etc.), you'll be able to get a better idea of whether it's information you care about or not. Again, you can take advantage of pre-trained models here.
Maybe train a text classification component to assign labels to whole sentences or paragraphs. This might work well for reports that are less information-dense and/or long-winded. A text classifier can also be helpful to distinguish between relevant information and "noise" – for example, if you're dealing with reports that include both company finances and other stuff, you could train a classifier to predict whether the text is about company finances and filter out the rest before you analyse the data further.
To elaborate a bit on combining statistical predictions and rules, and taking advantage of the dependency parser and tagger (which is often very underappreciated IMO), here's a simplified sentence from one of your examples in our displaCy dependency visualizer:
If you know that the tokens "864 million" are a MONEY entity (e.g. because your entity recognizer predicted it), you can walk up the tree and check how it attaches to the rest of the sentence: in this case, the phrase is a direct object attached to a verb with the lemma "total" (to total) and the subject of the sentence is "sales". Once you start working with your data, you might find a lot of similar patterns that let you cover a large number of relevant sentences.
Again, if you just run the pre-trained model on your text out-of-the-box, you might find that the predictions aren't that great, because your texts are quite specific. But you can fine-tune those as well, just like you do with the entity recognizer. In Prodigy, you can use the pos.teach and dep.teach recipes to fine-tune the part-of-speech tagger and dependency parser. I wouldn't bother too much with all the random obscure labels and focus on the most important ones: nouns and verbs, and subject/object relationships. If the model gets those right, you'll be able to work with that and the rest will fall into place much more easily. NOUN and VERB are also pretty easy to annotate, even without a deep background in linguistics.
Btw, focusing on training the generic components also means that you'll end up with a pretty powerful base model with a part-of-speech tagger, dependency parser and named entity recognizer fine-tuned on company reports. Having that in your toolbox will be incredibly useful, even if you end up needing to analyse different things in the future. If you have the basics covered, you can always add different rules and components on top of them.
One thing that's important to consider is that NLP models can generally only look at a very narrow context window. This is especially true for named entity recognition and short text classification – but even for long text classification, a common strategy is to classify shorter fragments and average over the predictions.
So as a rule of thumb, we often recommend the following: If you are not able to make the annotation decision based on the local context (e.g. one paragraph), the model is very unlikely to be able to learn anything meaningful from that decision.
Of course that doesn't mean that your problem is impossible to solve – it might just need a slightly different approach that leverages what's possible/easy to predict and takes into account the limitations. I'm not an expert in earnings reports, but let's assume you have a lot of reports that kinda look like the Electrolux one. The information about the period / time frame is encoded in a headline and everything following that headline implicitly refers to the second quarter. So one approach could be this:
Detect the headlines. Maybe you only need rules here, because headlines are pretty distinctly different from the rest of the text. Maybe it's better to train a text classifier to predict HEADLINE. This depends on the data and is something you need to experiment with.
Associate text with headlines. This one is hopefully easy – we can assume that all text up to the next headline belongs to the previous headline.
Detect whether a headline references a quarter and if so, normalize that to a structured representation. For example, you might want a function that does this: "second quarter of 2018" → {'q': 2, 'year': 2018}. Custom attributes in spaCy are a great way to attach this type of data to documents and paragraphs btw.
As I said, this is just an idea that came to mind, and I'm obviously not a domain expert But I hope it illustrates the kind of thought process that could go into developing a strategy like this.
You'll still need to try it – but I hope Prodigy makes it easy to run these types of experiments and validate your ideas. This was also one of our main motivations when developing the tool: the "hard part" is figuring out what works, so you want to be running lots of small experiments on smaller datasets and do this as fast and as efficiently as possible.
Thank you so much for putting time and effort into such a perfectly outlined answer. I’ve been looking into some of your work (explosion.ai) this weekend and I really admire your philosophy and the quality you bring to the market (enjoyed your keynote speak at PyData as well!).
Anyhow; I bought a licence yesterday and I can’t wait to get started. There seems to be an issue with the acquiring the product for some reason so I dropped you an email at contact@explosion.ai.
Btw. regarding your second point in the last paragraph where you suggest to associate text with headlines. Is there a good way to do that in spaCy (maybe using custom attributes is a good fit for this as well) or should I just use tuple or dict or whatever?
(And sorry about the order problem – looks like the accidental double payment messed up the system and it failed to associate the payment with the order. I sent you an email with the new order and download link.)
Yes, custom attributes could be good for that, too! In general, I'd recommend using the Doc objects as the "single source of truth" of your application that holds all the information about the document. spaCy's data structures (Doc, Span, Token) are optimised for performance and efficiency and preserve all information of the original text. Given any token, you'll always be able to recover its original position and its relationships with other tokens in the document. This is super powerful and something you don't want to lose in your application. (A mistake people sometimes make is converting information to strings and simpler data structures like lists or data frames way too early.)
For the headlines, you could maybe do something like this and assign the Span of the headline and its meta data as a custom attributes on the token-level. The token level might be nice here, because it's the smallest possible unit. For any token or span of tokens you create later on (e.g. via the entities in the doc.ents or via rule-based matches), you'll then be able to retrieve the headline it refers to.
from spacy.tokens import Token
# register the extensions token._.headline and token._.year
Token.set_extension('headline', default=None)
Token.set_extension('year', default=None)
doc = nlp("This is a headline. This is some text.")
headline = doc[0:5] # this is a Span object including token 0-4
for token in doc[5:10]: # the rest of the text
token._.headline = headline # assign headline to tokens
# set structured data on the token – this could come from
# a function you write that parses the headline text – in
# "real life", you'd probably want to do this more elegantly
token._.year = get_year_from_headline(headline)
For any token in the document – for example, a token in a MONEY entity, you'll now be able to check its ._.year attribute to find out whether it's associated with a year based on its headline.
Good point on the Doc object. I could imagine that would be a common pitfall. The first input is actually an .xml file that contains the html code. Would you keep the whole thing in Doc or skip either some .xml and the html?
For the ner.teach I need to point to a data source. Should I transform the list of .xml files into a specific format, like a single .jsonl file? Or can I take care of it in a custom recipe instead? I imagine I should use something with Stream to achieve best practice?
If you have control over your input data during training and runtime (e.g. if you can pre-process your text before analysing it with your runtime model), it probably makes sense to spend that extra time and write a preprocessing script that parses and cleans your XML, strips out the HTML markup and other metadata and normalises the text if necessary (unicode stuff etc.) This usually makes training easier, because you don't have to spend time teaching the model how to deal with broken and leftover markup.
If you haven't seen it yet, I'd recommend checking out textacy, which has a bunch of really useful preprocessing utilities for normalising whitespace, fixing mojibake, stripping out URLs and so on.
Yes, you could either do this at runtime using a custom loader, or in a separate pre-processing step that outputs JSON or JSONL. What you decide to do really depends on thre data and your personal preference.
XML isn't always the most practical format, so it might be useful to start by converting everything into a format that's easier to work with JSONL is pretty nice because it can be read in line-by-line, so you don't run into performance issues for really large files.
Cool I’ll take care of the preprocessing. However should I include a whole earnings report for each row in JSONL? I’ve just noticed that you recommend only giving it small phrases so I’m not entirely sure how to chunk my earnings report into a training JSONL dataset. The reports also include some markup tables and I suppose I should handle these without spaCy. I imagine having a pipeline where I first do some “document layout analysis”. Send some of the document to spaCy and the rest to a table parser.