NER for Financial Text

I am planning to annotate a bunch of financial documents for NER and Relation extraction tasks. Each document has a list of bank transactions so NER will have entities like BankName, BankAccount, TransactionAmount, AccountAction [ deposited, withdrew] AccountOwner. The Relations will be like AccountOwner - AccountAction - TransactionAccount-BankAccount.

I would like to pre-annotate the text with a Transformer based Spacy pipeline. This is my first Prodigy project.

Question(1) What Spacy Transformer pipeline should I use for pre-annotation ?

Question(2) What prodigy recipe should I use to pre-annotate with the pipeline from Question(1)

I would appreciate any guidance.

Nick

Pre-annotation would mainly work if the entities from the trained model also occur in your dataset. I can imagine that some of the entities like BankName might overlap with the ORG entity and the TransactionAmount might overlap with the MONEY entitiy. But I would imagine that things like account numbers, as well as the relationships, would require customisation.

If I were starting this I would honestly forget about transformers and would instead focus on just getting your first 100 annotations in manually. From here I might train a lightweight spaCy model first and try and use that to pre-fill any entities. This is partially because it'll be easier to start, but also because it's important to get exposed to data as quickly as possible as it helps you understand the problem better.

You should be able to start with the standard ner.manual recipe, then train with the train recipe and then use that model in the ner.correct or ner.teach recipe.

Then, when it comes to relation extraction, we did indeed just launch the experimental coref component (blogpost) but this is not yet integrated with Prodigy. We do have an example spaCy project (here)[projects/experimental/coref at v3 · explosion/projects · GitHub] that shows you an example of a custom pipeline that trains coref though.

Feel free to ask for more clarification. It might be fun to touch base once in a while when you've reached a milestone :slightly_smiling_face:

2 Likes

Appreciate the advice @koaning . Thanks much for laying out the roadmap in detail. It helps a beginner like me.

I have a choice regarding the structure of the input text. It could be plain text ( lines, sentences, paragraphs) or sentencized text ( a long list of sentences, with one sentence per line. )

I read the ner.manual will tokenize and sentencize the input. I found a "mark" recipe in Prodigy documentation that keeps the original text structure. So I am thinking of using this recipe.

I understand that if I use pre-sentencized text for creating annotations then the traning and production text will have to be pre-sentencized the same way for things to work.

Questions:

(1) Is there any down-side to using the mark recipe ? Will the annotations created using the mark recipe be good for training Spacy NER pipelines ( with transformers and without transformers) ?

(2) I would to try out some transformers from Huggingface in the future. Is there a way to export Prodigy annotations into a Huggingface transformer training format ?

(3) If the sentenceized text has tokens that are not pre-built into spacy language models, can I add those tokens to spacy vocabulary ? I think spacy has its own vocabulary ( I could be wrong).

Thanks much for the insights.

Regards,
Nick

I'll answer some of your questions inline below.

Question 1

Is there any down-side to using the mark recipe ?

The main one that I can come up with is that the prodigy mark recipe is very general. It can grab a loader and a view_id and just roll with it without having to write a custom recipe. However, if you know 100% for sure that you're interested in named entities then the ner.manual recipe has many more customization options.

In particular, you can leverage patterns! Which can be a very helpful feature. Especially in your case when you're working with account numbers. Some of these can be handled very well by regexes!

Question 2

Is there a way to export Prodigy annotations into a Huggingface transformer training format?

The tricky thing here is that Huggingface might be using a different tokeniser. That means that, theoretically, your annotations might not align when you annotate using a spaCy tokeniser.

That said, if you want to use transformers directly inside of spaCy then you shouldn't have any issue using the annotations from Prodigy. This forum reply explains the process in more detail:

It's only when you want to train Huggingface models without spaCy where you might want to be careful. Also within Huggingface you might come across models that use different tokenisers and you want to make sure that the tokens are compatible. There might be some community translation scripts for this though.

Question 3

I think spacy has its own vocabulary ( I could be wrong).

spaCy doesn't have a vocabulary like you might be used to from libraries like scikit-learn. It uses some clever hashing tricks to represent tokens, but also does it in such a way that it can deal with new tokens at runtime. It does come with a vector table with word embeddings, and while this can be interpreted as a vocabulary, the hashing tricks in spaCy take care of new words at runtime during training.

This blogpost explains the hashing trick in more detail if you're interested.

This Twitter thread explains the "Vocab" in a bit more detail as well:

Thanks much @koaning for the detailed response. It is a big help.

I will start with ner.manual as it seems like the best place to start ner annotation for my task.

Regards,
Nick

Grand.

By all means, though, feel free to touch base after a while. As you progress, it might be nice to have a long thread on this forum. :smile:

Thanks much @koaning for your responses. With your guidance I have made some progress.

Now, I am using the ner.manual recipe with the en_core_web_sm model to annotate the text documents after converting input text into one sentence per line format by hand.

Some new questions -

(1) Can I create one sentence per line format of the input text files using Spacy / Prodigy ?

(2) The input text doc is made up of several sections. I would like to assign a "section_name" label to each sentence that is a part of the given section. Is that possible in Prodigy?

(3) Is there any way I can export the annotations created using ner.manual recipe in word|POS|BILUO format ? There is a spacy utility (https://spacy.io/api/cli#convert) that works on a ' a serialized [DocBin]' to export the annotation in the word|POS|BILUO format . Could a process like : Prodigy annotations dataset --> DocBin --> word|POS|BILUO work ?

The purpose behind (3) is to use word|POS|BILUO format data as training data for (non spacy ) models.

Appreciate your patience.

Nick

Question 1

Can I create one sentence per line format of the input text files using Spacy / Prodigy ?

There are some recipes that can split the text on your behalf. The ner.correct and ner.teach recipes both carry a --unsegmented flag that could be used.

More generally though, the most direct way of getting what you want is to adapt the examples.jsonl file that you pass to Prodigy. Nothing is stopping you from pre-processing this file as you see fit, which includes running spaCy beforehand to turn paragraphs into sentences. You'd use a script that looks something like below:

import spacy 
import srsly

# Best practice: use the same model as you use in Prodigy 
nlp = spacy.load("en_core_web_md") 

examples = srsly.read_jsonl("path/to/examples.jsonl") 

def to_sentences(examples):
    # parse each example with spaCy
    for doc in nlp.pipe(examples):
        # loop over all the detected sentences
        for sent in doc.sents:
            yield {"text": sent.text} 

# Create new generator with sentence examples 
new_examples = to_sentences(examples)

# Save them to disk 
srsly.write_jsonl("path/to/sentence_examples.jsonl", new_examples)

One small caveat: the sentence splitting heuristics in spaCy can make mistakes. On social media data in particular it can make a errors when punctuation is all over the place.

Another caveat: it's hard to say upfront. But sometimes it's easier for the model to detect spans/entities when you keep neighboring sentences around. This'd mainly be a problem if you have very short sentences and you're trying to detect very long spans with somewhat fuzzy starts and ends. I don't think it'll be much of a problem for you, but I figured it good to at least mention that this might have a consequence for the trained model later.

Question 2

I would like to assign a "section_name" label to each sentence that is a part of the given section. Is that possible in Prodigy?

Sure, that sounds like you'd also want to train a textclassifier and Prodigy has support for that. How many categories are you talking about though? Also, can these categories overlap?

You'd probably use an interface like textcat.manual to annotate some examples and then when it is time to train a model you can tell Prodigy to use the two datasets for the two tasks. That command would look something like:

python -m prodigy out_dir --ner ner_dataset_name --textcat clf_dataset_name --lang en --base-model en_core_web_md 

Question 3

Is there any way I can export the annotations created using ner.manual recipe in word|POS|BILUO format ?

As long as you can write a custom Python script that does what you want, you can output the data into any format you like! This isn't something I can give much advise on though, mainly because I'm only familiar with the spaCy and scikit-learn ecosystems when it comes to file-formats.

If you're doing down the custom scripting route, you'll probably want to use this example as a starting point. In particular, you'd probably write something like:

from prodigy.components.db import connect

def turn_into_custom_format(example):
    # you'd need to implement this yourself
    pass

db = connect()                                  # uses settings from prodigy.json
dataset = db.get_dataset("name_of_dataset")     # retrieve a dataset

new_dataset = (turn_into_custom_format(e) for e in dataset)

The db.get_dataset method returns a list of dictionaries that can you can edit any way you see fit.

@koaning thanks much for the detailed response. It is very helpful.

I have tried creating NER annotations in three different ways -

(1) Using ner.manual : I see the text and token spans and the NER annotation spans in the annotated dataset
(2) Using mark : I see the text and the NER entity spans in the annotated dataset
(3) Using ner.correct: I see the text, the tokens, the sentences and the NER entity spans in the annotated dataset.

My ultimate goal is to train a Spacy pipeline to extract my ( custom) NER entities that I am annotating using Prodigy.

I have some entities that are 6 to 10 tokens long. So I am also interested in trying out SpanCategorizer in the future. The entity spans do not overlap but are long.

Questions-
(1) When is comes to training the Spacy NER pipeline / Spacy SpanCategorizer pipeline with my hand annotated dataset, would any particular dataset provide better training than the others ?

In other words, will the dataset created with mark recipe be as effective for training a Spacy pipeline as the other two datasets ( one created with ner.manual and the other created with ner.correct ?)

(2) If all the three datasets ( mark, ner.manual, ner.teach) are equally effective in training custom NER extraction pipeline / SpanCategorizer pipeline then what is the advantage ( beyond say training spacy) in using ner.manual and ner.correct over mark ?

The more explore and learn about the Prodigy and Spacy ecosystem, the more I am fascinated by what I see.

Nick

Answers.

When is comes to training the Spacy NER pipeline / Spacy SpanCategorizer pipeline with my hand annotated dataset, would any particular dataset provide better training than the others ?

You can combine datasets from different recipes! In general, with a bit of hand waving, more data is always better. Assuming at all of the data your adding is relevant to your problem.

Note: when you train a spaCy model via Prodigy, you can pass multiple datasets. The command would look something like:

prodigy train --ner dataset1 dataset2 dataset3 ... 

If all the three datasets ( mark, ner.manual, ner.teach) are equally effective in training custom NER extraction pipeline / SpanCategorizer pipeline then what is the advantage ( beyond say training spacy) in using ner.manual and ner.correct over mark ?

High-over, I think these are the mayor differences to consider:

  1. The mark recipe is meant as a general recipe where you can select the view_id and the loader without having to write your own custom recipe. If ner.manual allready works for you, I'd pick that because it has more customisation options.
  2. The ner.manual recipe is pretty straightforward, but it requires manual labor. You'd need to move your cursor over the tokens to annotate them which can be time consuming.
  3. The ner.teach recipe gives a binary interface. That means that you can just say "yes/no" to an annotation, which might be a much faster way to annotate. You would need a method that can pre-annotate the examples for this to work though.

Thanks much for the insights @koaning.

Here is my final plan of action -

(1) Pre-process examples text to one sentence per line format.
(2) Annotate the preprocessed examples with ner.manual
(3) The section_name can be inferred from rules, so textcat ML model will not be required.
(4) Train and test NER pipeline ( transformer based; with pre-training and fine tuning)
(5) Convert ner.manual annotations data to spancat format. It should be easy.
(6) Train and test a spancat model ( transformer based; with pre-training and fine tuning) using dataset from step (5).

My task does not have overlapping entity spans but there are long ( 5 to 10 tokens in length) entities. So I believe spancat might do better with the long entities. Hence steps(5) and (6).

I have found many references in Spacy and Prodigy documentation about training a transformer model for NER. But it is not clear if that is pre-training or fine-tuning ( because of my ignorance of the spacy framework.)

Would it be possible to point me to a tutorial or example notebook for -
(1) pre-training a transformer based NER pipeline
(2)pre-training a transformer based spancat pipeline
(3)fine-tuning a transformer based NER pipeline
(4)fine-tuning a transformer based spancat pipeline

I think the pre-training steps (1) and (2) are one and the same, as the text data is identical, so only one round of pre-training should do for both NER and spancat.

Thanks much for the help.

Let's mention a few things here.

You can try out a transformer, but I recommend also trying a lightweight CNN model. If the lightweight model performs nearly as well it might be a better option for production because it's able to run much faster.

With regards to training with transformers, the simplest way to think about it is that you merely need to make a change to the config.cfg file that has all the parameters for the spaCy model that you'll train.

You might appreciate the answer I've given here:

Great point!

My updated plan now looks like -

(1) Pre-process examples text to one sentence per line format.
(2) Annotate the preprocessed examples with ner.manual
(3) The section_name can be inferred from rules, so textcat ML model will not be required.
(4) Train and test NER pipeline (with lightweight CNN models )
(5) Train and test a spancat model ( with lightweight CNN models)
(6) Try (4) and (5) with Transformers based models, if desired for whatever reason.

I now have a good understanding of Prodigy usage and I also have a good plan for constructing my NER application.

Thanks much @koaning for you patience and expert help that made the above possible.

1 Like

No worries!

Keep me posted though!