Teaching a spaCy model to attend to the right n-gram

I have a task where I need to do some structured information extraction on exchange statements. So far I've been using a rule based approach using spacy matcher together with some sentence logic.

However I'd prefer migrating into a ML approach where I teach a model to attend to the tokens of interest. But I'm not sure if thats possible in spaCy? I love the spaCy framework so I'd prefer that.

Given an exchange statement like this I'd like to extract something like OUTLOOK-EBIT_MARGIN which would be around the same level as last year. I imagine having a list of n-grams with hundreds of features each and train a classifier and extract highest scoring n-gram. As features I'd have x-y coordinates, bold, italic etc.. The text itself is not enough since the information could be found in tables as well.

Let me know if this is something that can be done in spaCy or if you have some comments. Thank you.

I should add that the exchange statements can be very different and might not have nice headers like this one.

Unfortunately you can't really do that in spaCy easily. You'll need to code it directly in PyTorch. Once you have the component you could integrate it into a spaCy pipeline, but spaCy won't really help with getting the model to work.

Is there a way you could make the task simpler? For instance, maybe you could have one classification scheme marking which metric the region of text is referring to, and have EDIT_MARGIN as a category. Then you could have four classes, like UPWARD, DOWNWARD, SAME, NUMBER. If the prediction is NUMBER, you go and get the numeric value from the text. This saves you from doing any text recognition, so that you don';t need to worry about the ngrams at all.

Thank you for getting back and always being so helpful!

I figured you'd say something like that and I'm actually just about to experiment with an approach like you outlined but unfortunately I can't extract the original wording with that approach, right? I'll discuss with my product owner if its still a good use case.

Since I am working with financial texts I'm considering training my own word vectors, i.e. my own sense2vec but I noted you wrote this

To train your own sense2vec vectors, you'll need the following:

  • A very large source of raw text (ideally more than you'd use for word2vec, since the senses make the vocabulary more sparse). We recommend at least 1 billion words.

I just checked and I have ~150.000 html documents that has been split into ~2.7M raw text paragraphs which holds ~100M "words" in total (some content are phone numbers, emails etc. as well) so I suppose you wouldn't recommend going down that road?

100M words is a bit small, yes. But maybe you can find other publicly available texts you can add that are related?

Training a sense2vec model is relatively easy and not very expensive, so you can always try and see. The sense2vec package also includes several Prodigy recipes for evaluating vectors, so this should give you an idea whether your vectors are useful or not. (It gets pretty obvious quickly if the vectors are bad and not useful.)

You're right. Try it out and see how it goes (maybe you just knew for a fact already :wink: ) but thanks for your answer!

Now I have a few questions but maybe it should be asked in another thread (if so, let me know)

I just realised that 1/100 documents are actually not in English so I suppose I should train a model to filter those out before or would you propose another solution?

Do you have any best practice when creating a pipeline using multiple models, one by one? For instance;

  1. Filter out garbage - non english.
  2. Tag paragraphs (is it a headline, is it content or ...).
  3. Tag content with context from last headline... (It might be better to combine 2 and 3 though - atm. I have ~5 major categories and within each 5-50 subcategories. I wonder if its better to just have 150 categories and discard the subcategory idea)

So far I've built my own python package that utilises different models but you might have seen this kind of approach a lot and have a framework for it?