NER for Financial Text

koaning · October 17, 2022, 8:53am

I'll answer some of your questions inline below.

Question 1

Is there any down-side to using the mark recipe ?

The main one that I can come up with is that the prodigy mark recipe is very general. It can grab a loader and a view_id and just roll with it without having to write a custom recipe. However, if you know 100% for sure that you're interested in named entities then the ner.manual recipe has many more customization options.

In particular, you can leverage patterns! Which can be a very helpful feature. Especially in your case when you're working with account numbers. Some of these can be handled very well by regexes!

Question 2

Is there a way to export Prodigy annotations into a Huggingface transformer training format?

The tricky thing here is that Huggingface might be using a different tokeniser. That means that, theoretically, your annotations might not align when you annotate using a spaCy tokeniser.

That said, if you want to use transformers directly inside of spaCy then you shouldn't have any issue using the annotations from Prodigy. This forum reply explains the process in more detail:

It's only when you want to train Huggingface models without spaCy where you might want to be careful. Also within Huggingface you might come across models that use different tokenisers and you want to make sure that the tokens are compatible. There might be some community translation scripts for this though.

Question 3

I think spacy has its own vocabulary ( I could be wrong).

spaCy doesn't have a vocabulary like you might be used to from libraries like scikit-learn. It uses some clever hashing tricks to represent tokens, but also does it in such a way that it can deal with new tokens at runtime. It does come with a vector table with word embeddings, and while this can be interpreted as a vocabulary, the hashing tricks in spaCy take care of new words at runtime during training.

This blogpost explains the hashing trick in more detail if you're interested.

This Twitter thread explains the "Vocab" in a bit more detail as well:

https://twitter.com/spacy_io/status/1511717758563667980

Topic		Replies	Views
Combining NER with text classification usage , ner , textcat	10	6912	March 20, 2024
Spacy NER - tokeniser for camembert-base ner	17	1149	March 15, 2023
Transform annotations to match tokenization required for SpanBERT/BERT spacy , transformers , spancat	19	1614	July 30, 2023
questions on Multi NERs Annotation & Training at Once in a Sentence usage , ner , spacy	5	617	October 3, 2022
help - first process of annotation usage , ner , solved , pos	15	926	August 7, 2021

NER for Financial Text

Question 1

Question 2

Question 3

Related topics