How about running spacy evaluate on BioBERT's test set, obtain the Precision/Recall/F-score, and comparing it with the reported results in the paper? It might also be done the other way around: run the trained model to a test set, get the predictions, and use those predictions with the seqeval repo.
How can I import my annotated databases to huggingface data?