I'm currently gathering data for work on a named entity linking application for my course project.
I've collected entities of ~70k people from wikidata, downloaded the summary from their wikipedia pages. I also started to collect paragraphs where their names were mentioned in pages that linked to their main wikipedia page, but I've given up on that for now as the data will be too big for prototyping. I'm going to script out the matches for training via prodigy.
I plan on taking the trained model and then running it against a corpus, where I hope to take the matches (wikidata ids) and put them in prodigy to have users tag the entities and have a model run when there is enough extra training data.
I'm currently in the data collection phase. What I would like to know is, given ~70k entities.
What is the best train/test split to use?
How many epochs should I be looking at?
How long will training take?
Can the model output a accuracy of the prediction when running again unseen data?
I appreciate that there isn't a definite answer to some of those questions, I'm just looking to get an idea of what resources I need for this project. I have a laptop with 16GB of ram and an i5, I can use Google Colab and my work machine has 32GB of ram and an i7.
Hi! You're right that it's difficult to give definitive answers here, because it all depends on many factors, including the data, type of model you're training, the end goal of your applicaton, and so on.
I recently posted this on a different thread and it might function as a good rule of thumb:
If you're running spaCy's train (or Prodigy's train, which just calls into spaCy under the hood), the training will run until the accuracy stops improving. So spaCy is able to take care of this decision for you and you can just run the training until it stops
This depends on your data, the type of model and your machine. spaCy's CNN models should train fine on CPU on your local machine, and you'd be looking at times between a couple of minutes to maybe an hour if you're training a component with a small-ish dataset (based on my experience with an i5 and 16GB RAM).
During your development and prototype phase, the most important signal you're looking for is "is my model learning something"? (and not squeezing out a final percent of accuracy, which may not even necessarily generalise much better). You'll typically be able to get a sense for this pretty quickly, without having to train for too long.
If you want to train a transformer-based pipeline later on, you should train it on GPU, but I wouldn't worry about it at this point. Initialising the model with better embeddings might give you a boost in accuracy of maybe a few percent, but the data is what's most important: if your model isn't learning well because your dataset is too small, imbalanced or inconsistent, using better embeddings won't magically fix that and you're always better off working on your data and making it better. If your results on your local CPU look great, then you might be able to get even better results using transformer embeddings or similar.
If you want to calculate the accuracy on unseen data, you'll need to compare its predictions with the correct answers you know, e.g. new unseen evaluation examples you've annotated with Prodigy. You can use spacy evaluate if you just want to output the accuracy of an already trained model on a given evaluation set.
That said, you typically want to use one dedicated and stable evaluation set that's represenative of the data the model will see at runtime, so you can properly compare the results between training runs. If your evaluation changes, the results won't necessarily tell you anything meaningful.