Entity Linking Epoch and resources required

Hi! You're right that it's difficult to give definitive answers here, because it all depends on many factors, including the data, type of model you're training, the end goal of your applicaton, and so on.

I recently posted this on a different thread and it might function as a good rule of thumb:

If you're running spaCy's train (or Prodigy's train, which just calls into spaCy under the hood), the training will run until the accuracy stops improving. So spaCy is able to take care of this decision for you and you can just run the training until it stops :slight_smile:

This depends on your data, the type of model and your machine. spaCy's CNN models should train fine on CPU on your local machine, and you'd be looking at times between a couple of minutes to maybe an hour if you're training a component with a small-ish dataset (based on my experience with an i5 and 16GB RAM).

During your development and prototype phase, the most important signal you're looking for is "is my model learning something"? (and not squeezing out a final percent of accuracy, which may not even necessarily generalise much better). You'll typically be able to get a sense for this pretty quickly, without having to train for too long.

If you want to train a transformer-based pipeline later on, you should train it on GPU, but I wouldn't worry about it at this point. Initialising the model with better embeddings might give you a boost in accuracy of maybe a few percent, but the data is what's most important: if your model isn't learning well because your dataset is too small, imbalanced or inconsistent, using better embeddings won't magically fix that and you're always better off working on your data and making it better. If your results on your local CPU look great, then you might be able to get even better results using transformer embeddings or similar.

If you want to calculate the accuracy on unseen data, you'll need to compare its predictions with the correct answers you know, e.g. new unseen evaluation examples you've annotated with Prodigy. You can use spacy evaluate if you just want to output the accuracy of an already trained model on a given evaluation set.

That said, you typically want to use one dedicated and stable evaluation set that's represenative of the data the model will see at runtime, so you can properly compare the results between training runs. If your evaluation changes, the results won't necessarily tell you anything meaningful.

2 Likes