Based from your screenshot, it seems that you only have 10 examples in total, is this intentional? Usually there's not much to learn from 8 examples, and not much to evaluate with 2 (i.e., if you get both wrong, then the accuracy is 0). In a way, the numbers we see during training makes sense given the number of samples we have.
My suggestion is to try it out in a relatively large sample of data. You can go little by little, first in the order of hundreds and so on.
According to the logs, you still only have 24 evaluation examples so it's a bit difficult to draw meaningful conclusions here since the evaluation data is so small.
If your data requires your custom tokenizer and may not predict accurately if it's not avaliable, this could be something to look into. If your base model is only intended to provide the tokenizer and no trained components to update, can you add it via the --config instead and leave out the --base-model?