How to evaluate the model accuracy with test data (not part of training)

e101sg · February 22, 2024, 2:16pm

Dear Prodigy community,
Greetings! Beginner here... I have doing a NER on text data. Done the annotations in Prodigy and able to train in Prodigy and Spacy as well. I have used following commands.
First convert to Spacy format and Spacy Train.

prodigy data-to-spacy ./spacy_model --ner ner_final --eval-split 0.2

python3 -m spacy train  spacy_model/config.cfg --output spacy_model/training/ --paths.train spacy_model/train.spacy --paths.dev spacy_model/dev.spacy

Fine. Mode created.Then model accuracy on the test data (which 20 %split on train data)

python3 -m spacy benchmark accuracy ./spacy_model/training/model-best spacy_model/dev.spacy --output metrics.json

Now, i have created another annotated dataset evaluation_dataset_gold in the database. Want to evaluate against this the Spacy model without split any split. How to do this ?

The convert command can be used ? Any thoughts highly useful. Thanks a lot.

Cheers!
Chandra /101sg

ryanwesslen · February 22, 2024, 3:47pm

hi @e101sg!

Thanks for your message.

If I understand correctly, you're just looking to use your evaluation_dataset_gold as your evaluation dataset, right?

If so, you can use the eval: prefix for datasets in your data-to-spacy as opposed to using --eval-split like:

prodigy data-to-spacy ./spacy_model --ner ner_final,eval:evaluation_dataset_gold

When you do this, your spacy_model/dev.spacy will just be your evaluation_dataset_gold as opposed to a random 20% partition of your ner_final. Then you can use the exact same spacy train or spacy benchmark commands like you previously did as the dev.spacy is now your dedicated hold out dataset as opposed to a random partition from your training data.

This is mentioned a bit in the docs:

For each component, you can provide optional datasets for evaluation using the eval: prefix, e.g. --ner dataset,eval:eval_dataset . If no evaluation sets are specified, the --eval-split is used to determine the percentage held back for evaluation.

Does this answer your question?

e101sg · February 23, 2024, 2:32pm

Dear Ryan,

============================== Generating data ==============================
Components: ner
Merging training and evaluation data for 1 components
  - [ner] Training: 590 | Evaluation: 84 (from datasets)
Training: 327 | Evaluation: 44
Labels: ner (7)
✔ Saved 327 training examples
spacy_model/train.spacy
✔ Saved 44 evaluation examples
spacy_model/dev.spacy

If I understand correctly, you're just looking to use your evaluation_dataset_gold as your evaluation dataset, right?
yes, correct

as shown in screenshot still data-to-spacy creates ./train and ./dev split thought it states Components: ner Merging training and evaluation data for 1 components

Not sure, what is the issue? Do i change anything in the config settings ?!!
Any thoughts...
Thanks!
Chandra

ryanwesslen · February 23, 2024, 8:09pm

Thanks for the update!

So I'm not sure what's the problem.

I know you mentioned you received this warning:

Components: ner Merging training and evaluation data for 1 components

But if your data only has ner annotations, this is what I'd expect.

Did you intend for the data in your datasets to have other annotations for a different component?

Sorry but can you clarify what's the problem?

I noticed you had the previous thread which I glanced through but didn't have a chance to go through extensively. So any additional background would be greatly appreciated.

e101sg · February 25, 2024, 3:52am

Dear Prodigy community and Ryan,

my earlier post is bit unclear and bit confusing. When i carefully re look the **./dev.spacy ** has the evaluation data without split. Here is the output

============================== Generating data ==============================
Components: ner
Merging training and evaluation data for 1 components
  - [ner] Training: 590 | Evaluation: 84 (from datasets)
Training: 327 | Evaluation: 44
Labels: ner (7)
✔ Saved 327 training examples
spacy_model/train.spacy
✔ Saved 44 evaluation examples
spacy_model/dev.spacy

============================= Generating config =============================
ℹ Auto-generating config with spaCy

Here is the observation and some doubts.

my evaluation dataset evaluation_dataset_gold has the 86 text chunks in JSONL file but here it is taken 84 (may be duplicate). Fine. why further it reduced to 44 instead of 84 ?
Able to do spacy benchmark accuracy command with --displacy-limit 86 the output (ner detection ) is parsed as HTML file has only 44 lines as expected.
Any thoughts?!

With thanks & regards,
101sg

ryanwesslen · February 26, 2024, 3:25pm

I suspect duplicates. Prodigy does data duplication by default. Can you examine the annotations (e.g., run db-out and export out the dataset to .jsonl) and see if you find any duplicates? They would be deduped (by default) by the input text, so you would likely have examples that have duplicated values for _input_hash.

e101sg · March 3, 2024, 3:49pm

Dear Ryan and Prodigy community,

I am sure among 86 entries no duplicates. Also, i have examined the annotations using the db-out and export into .jsonl file . I can see same number (86) of _input_hash' and _task_hash. Is it normal ?
I am creating the test dataset manually with target 7 labels (Personally Identifiable Information like Phone, Email,Passport Number, Bank Account, National ID number, Car Plate Num etc) info like and non-label test chunks.For example,

{"text":"Can you transfer the funds to my bank account? Here are my bank account details: 344-33165-7"}
{"text":" my bank account details: 344-33111-9 If you need further clarification mail me"}`

The text chunks usually shorter but long sentence as well. Is it okay? any suggestion.

Also the documents says about the exclude_by, Did not understand what it means or not sure it is useful for my ner.manual task.

As of v1.9, recipes can return an optional "exclude_by" setting in their "config" to specify whether
to exclude by "input" or "task" (default). Filtering and excluding by input hash is especially useful
for manual and semi-manual workflows like ner.manual and ner.correct.

Any thoughts welcome. Thanks a lot.

Cheers~!
101sg

ryanwesslen · March 5, 2024, 2:57pm

hi @e101sg,

Where are you getting the 86 number from?

Your output had these numbers:

============================== Generating data ==============================
Components: ner
Merging training and evaluation data for 1 components
  - [ner] Training: 590 | Evaluation: 84 (from datasets)
Training: 327 | Evaluation: 44
Labels: ner (7)
✔ Saved 327 training examples
spacy_model/train.spacy
✔ Saved 44 evaluation examples
spacy_model/dev.spacy

This shows 84 an input evaluation number of 84. Where is the 86 coming from?

I suspect they are in the dataset evaluation_dataset_gold. I'm wondering if you had rejected 2 of the examples, hence why it's 84 not 86.

Can you run?

python -m prodigy stats -l evaluation_dataset_gold

What do you get? It should show the number of annotations by ACCEPT, REJECT and IGNORE.

Unfortunately, I still think the 84 -> 44 are still due to duplicates. Are you using overlapping annotations? Maybe you could try "exclude_by": "input" (see below), but I don't think that would do anything.

It may also be related to this:

For example, data-to-spacy will group all annotations with the same input hash together, so you'll get one example annotated with all categories, entities, POS tags or whatever else you labelled.

I don't see any big problems or at least relating to your dedupes.

This is how deduplication is done. If set in default ("exclude_by": "task"), this means deduplication occurs for a task (individual input + unique task run). Alternatively, "exclude": "input" would mean deduplication strictly based on input. This essentially means is deduplication done by the task_hash (i.e., "exclude_by": "task") or by input_hash (i.e., "exclude_by": "input").

Can you read through this:

Namely this post:

e101sg · March 12, 2024, 7:41am

Sure the actual database name is test_dataset. So running the prodigy stats test_dataset gives the

As seen in the Dataset Stats the 44 comes from "Accept" in the test_dataset. So, for the test dataset it is better to have NER entries with more "Accept" ... that is why it is know as Gold standard! (to test against with training data).
Thanks Ryan for your reply and links. my doubt is cleared. Cheers!

With kind regards,
e101sg

Topic		Replies	Views
how to test my model on new dataset ? usage , spacy , solved	2	955	April 26, 2020
Create baseline metrics based on manual NER annotations usage , ner , solved	3	671	June 8, 2020
NER: CLI command for Validation set usage , ner , spacy	2	418	September 16, 2020
NER - basic model doubt ner	13	387	February 22, 2024
Reproducing prodigy ner.batch-train in spacy: cross-validation results and outputted model usage , ner	3	1880	October 5, 2018

How to evaluate the model accuracy with test data (not part of training)

Related topics