hi @e101sg,
Where are you getting the 86
number from?
Your output had these numbers:
============================== Generating data ==============================
Components: ner
Merging training and evaluation data for 1 components
- [ner] Training: 590 | Evaluation: 84 (from datasets)
Training: 327 | Evaluation: 44
Labels: ner (7)
✔ Saved 327 training examples
spacy_model/train.spacy
✔ Saved 44 evaluation examples
spacy_model/dev.spacy
This shows 84 an input evaluation number of 84. Where is the 86 coming from?
I suspect they are in the dataset evaluation_dataset_gold
. I'm wondering if you had rejected 2 of the examples, hence why it's 84 not 86.
Can you run?
python -m prodigy stats -l evaluation_dataset_gold
What do you get? It should show the number of annotations by ACCEPT, REJECT and IGNORE.
Unfortunately, I still think the 84 -> 44 are still due to duplicates. Are you using overlapping annotations? Maybe you could try "exclude_by": "input"
(see below), but I don't think that would do anything.
It may also be related to this:
For example,
data-to-spacy
will group all annotations with the same input hash together, so you'll get one example annotated with all categories, entities, POS tags or whatever else you labelled.
I don't see any big problems or at least relating to your dedupes.
This is how deduplication is done. If set in default ("exclude_by": "task"
), this means deduplication occurs for a task (individual input + unique task run). Alternatively, "exclude": "input"
would mean deduplication strictly based on input
. This essentially means is deduplication done by the task_hash
(i.e., "exclude_by": "task"
) or by input_hash
(i.e., "exclude_by": "input"
).
Can you read through this:
Namely this post: