How to evaluate the model accuracy with test data (not part of training)

hi @e101sg,

Where are you getting the 86 number from?

Your output had these numbers:

============================== Generating data ==============================
Components: ner
Merging training and evaluation data for 1 components
  - [ner] Training: 590 | Evaluation: 84 (from datasets)
Training: 327 | Evaluation: 44
Labels: ner (7)
✔ Saved 327 training examples
spacy_model/train.spacy
✔ Saved 44 evaluation examples
spacy_model/dev.spacy

This shows 84 an input evaluation number of 84. Where is the 86 coming from?

I suspect they are in the dataset evaluation_dataset_gold. I'm wondering if you had rejected 2 of the examples, hence why it's 84 not 86.

Can you run?

python -m prodigy stats -l evaluation_dataset_gold

What do you get? It should show the number of annotations by ACCEPT, REJECT and IGNORE.

Unfortunately, I still think the 84 -> 44 are still due to duplicates. Are you using overlapping annotations? Maybe you could try "exclude_by": "input" (see below), but I don't think that would do anything.

It may also be related to this:

For example, data-to-spacy will group all annotations with the same input hash together, so you'll get one example annotated with all categories, entities, POS tags or whatever else you labelled.

I don't see any big problems or at least relating to your dedupes.

This is how deduplication is done. If set in default ("exclude_by": "task"), this means deduplication occurs for a task (individual input + unique task run). Alternatively, "exclude": "input" would mean deduplication strictly based on input. This essentially means is deduplication done by the task_hash (i.e., "exclude_by": "task") or by input_hash (i.e., "exclude_by": "input").

Can you read through this:

Namely this post:

1 Like