I'll answer some of your questions inline below.
Question 1
Is there any down-side to using the mark recipe ?
The main one that I can come up with is that the prodigy mark
recipe is very general. It can grab a loader and a view_id
and just roll with it without having to write a custom recipe. However, if you know 100% for sure that you're interested in named entities then the ner.manual recipe has many more customization options.
In particular, you can leverage patterns! Which can be a very helpful feature. Especially in your case when you're working with account numbers. Some of these can be handled very well by regexes!
Question 2
Is there a way to export Prodigy annotations into a Huggingface transformer training format?
The tricky thing here is that Huggingface might be using a different tokeniser. That means that, theoretically, your annotations might not align when you annotate using a spaCy tokeniser.
That said, if you want to use transformers directly inside of spaCy then you shouldn't have any issue using the annotations from Prodigy. This forum reply explains the process in more detail:
It's only when you want to train Huggingface models without spaCy where you might want to be careful. Also within Huggingface you might come across models that use different tokenisers and you want to make sure that the tokens are compatible. There might be some community translation scripts for this though.
Question 3
I think spacy has its own vocabulary ( I could be wrong).
spaCy doesn't have a vocabulary like you might be used to from libraries like scikit-learn. It uses some clever hashing tricks to represent tokens, but also does it in such a way that it can deal with new tokens at runtime. It does come with a vector table with word embeddings, and while this can be interpreted as a vocabulary, the hashing tricks in spaCy take care of new words at runtime during training.
This blogpost explains the hashing trick in more detail if you're interested.
This Twitter thread explains the "Vocab" in a bit more detail as well: