Getting the list of mispredictions on evaluation dataset

Hi there,

I am trying to fine-tune the dependency parser for 'noisy user-generated' data. I labeled ~500 sentences/fragments with:

!python -m prodigy dep.correct train_dependency en_core_web_lg ./train_data_dependency.txt

Then, I re-train the en_core_web_lg as follows:

!python -m prodigy train parser new_set45 en_core_web_lg --output ./parser_model --eval-split 0.2 -n 10

How do I get the predictions on the evaluation dataset so that I can put more data similar to the failing examples in the eval dataset and retrain the model?

I am using Prodigy 1.10.4


Hi! You should be able to extract that by just running your model over the data: load the model you trained (./parser_model) in spaCy in a script, notebook etc., load in your texts from the dataset and get the dependency labels and heads. Then compare those to the labels and heads in the data. If they're different, the model's prediction was wrong, and you can output the example, and see if you can spot patterns, like certain text types or constructions.

Hi Ines,

Thank you very much for your response. After I find mis-parsed sentences, I add data similar to them in the training data and label them:

!python -m prodigy dep.correct train_dependency ./parser_model ./train_data_dependency.txt

and then retrain the ./parser_model as follows:

!python -m prodigy train parser new_set45 ./parser_model --output ./parser_model --eval-split 0.2 -n 10

Is that right?


Yes, that's correct. When you re-train, you should start with the base model, en_core_web_lg, and all annotations, though, and train from scratch. That's cleaner than using the trained artifact.

If you're not doing this already, it might also be a good idea now to use a dedicated and separate evaluation set (instead of just holding back 20% of the training examples for evaluation). If you're always evaluating on the same examples, you'll actually be able to properly compare the results between runs and get a better idea of whether your model is improving. (Just make sure you're not accidentally adding any of the mis-predicted evaluation examples to the training data :sweat_smile:)

Thank you very much Ines!