How do I get the predictions on the evaluation dataset so that I can put more data similar to the failing examples in the eval dataset and retrain the model?
Hi! You should be able to extract that by just running your model over the data: load the model you trained (./parser_model) in spaCy in a script, notebook etc., load in your texts from the dataset and get the dependency labels and heads. Then compare those to the labels and heads in the data. If they're different, the model's prediction was wrong, and you can output the example, and see if you can spot patterns, like certain text types or constructions.
Yes, that's correct. When you re-train, you should start with the base model, en_core_web_lg, and all annotations, though, and train from scratch. That's cleaner than using the trained artifact.
If you're not doing this already, it might also be a good idea now to use a dedicated and separate evaluation set (instead of just holding back 20% of the training examples for evaluation). If you're always evaluating on the same examples, you'll actually be able to properly compare the results between runs and get a better idea of whether your model is improving. (Just make sure you're not accidentally adding any of the mis-predicted evaluation examples to the training data )