Entity Linking Epoch and resources required

Mike · August 17, 2021, 10:28am

Hi,

I'm currently gathering data for work on a named entity linking application for my course project.

I've collected entities of ~70k people from wikidata, downloaded the summary from their wikipedia pages. I also started to collect paragraphs where their names were mentioned in pages that linked to their main wikipedia page, but I've given up on that for now as the data will be too big for prototyping. I'm going to script out the matches for training via prodigy.

I plan on taking the trained model and then running it against a corpus, where I hope to take the matches (wikidata ids) and put them in prodigy to have users tag the entities and have a model run when there is enough extra training data.

I'm currently in the data collection phase. What I would like to know is, given ~70k entities.

What is the best train/test split to use?
How many epochs should I be looking at?
How long will training take?
Can the model output a accuracy of the prediction when running again unseen data?

I appreciate that there isn't a definite answer to some of those questions, I'm just looking to get an idea of what resources I need for this project. I have a laptop with 16GB of ram and an i5, I can use Google Colab and my work machine has 32GB of ram and an i7.

Thanks

ines · August 17, 2021, 11:44pm

Hi! You're right that it's difficult to give definitive answers here, because it all depends on many factors, including the data, type of model you're training, the end goal of your applicaton, and so on.

I recently posted this on a different thread and it might function as a good rule of thumb:

Help with textcat workflow

This is definitely a good question and it really depends on the data and the label distribution – if you have lots of labels, including some that are rare, you usually want a larger evaluation set to make sure you have all labels covered. If the set is too small, your results will also become harder to interpret: if you're only evaluating on a small number of examples, even one or two individual predictions can easily make up for a few percent in accuracy difference.

In the beginning, aiming for an evaluation set of about the same size as your training set might be a good approach. So you could train on 300 examples and evaluate on 300. Once you're satisfied with your evaluation set, you can then keep it stable and train on 800, 1000, 1200 examples using the same 800 evaluation examples.

If you're running spaCy's train (or Prodigy's train, which just calls into spaCy under the hood), the training will run until the accuracy stops improving. So spaCy is able to take care of this decision for you and you can just run the training until it stops

This depends on your data, the type of model and your machine. spaCy's CNN models should train fine on CPU on your local machine, and you'd be looking at times between a couple of minutes to maybe an hour if you're training a component with a small-ish dataset (based on my experience with an i5 and 16GB RAM).

During your development and prototype phase, the most important signal you're looking for is "is my model learning something"? (and not squeezing out a final percent of accuracy, which may not even necessarily generalise much better). You'll typically be able to get a sense for this pretty quickly, without having to train for too long.

If you want to train a transformer-based pipeline later on, you should train it on GPU, but I wouldn't worry about it at this point. Initialising the model with better embeddings might give you a boost in accuracy of maybe a few percent, but the data is what's most important: if your model isn't learning well because your dataset is too small, imbalanced or inconsistent, using better embeddings won't magically fix that and you're always better off working on your data and making it better. If your results on your local CPU look great, then you might be able to get even better results using transformer embeddings or similar.

If you want to calculate the accuracy on unseen data, you'll need to compare its predictions with the correct answers you know, e.g. new unseen evaluation examples you've annotated with Prodigy. You can use spacy evaluate if you just want to output the accuracy of an already trained model on a given evaluation set.

That said, you typically want to use one dedicated and stable evaluation set that's represenative of the data the model will see at runtime, so you can properly compare the results between training runs. If your evaluation changes, the results won't necessarily tell you anything meaningful.

Mike · August 18, 2021, 1:33pm

Thank you @ines

Topic		Replies	Views
How to evaluate the model accuracy with test data (not part of training) usage , ner , spacy	8	731	March 12, 2024
questions on Multi NERs Annotation & Training at Once in a Sentence usage , ner , spacy	5	615	October 3, 2022
impact of percentage of evaluation data on performance spacy , spancat	9	944	December 13, 2022
Detailed evaluation of NER model trained from Prodigy annotations usage , ner , training	6	720	December 14, 2021
accuracy not improving much with ner.batch-train usage , ner	16	922	December 20, 2019

Entity Linking Epoch and resources required

Related topics