Dear Prodigy community,
Greetings! Beginner here... I have doing a NER on text data. Done the annotations in Prodigy and able to train in Prodigy and Spacy as well. I have used following commands.
First convert to Spacy format and Spacy Train.
Now, i have created another annotated dataset evaluation_dataset_gold in the database. Want to evaluate against this the Spacy model without split any split. How to do this ?
The convert command can be used ? Any thoughts highly useful. Thanks a lot.
When you do this, your spacy_model/dev.spacy will just be your evaluation_dataset_gold as opposed to a random 20% partition of your ner_final. Then you can use the exact same spacy train or spacy benchmark commands like you previously did as the dev.spacy is now your dedicated hold out dataset as opposed to a random partition from your training data.
For each component, you can provide optional datasets for evaluation using the eval: prefix, e.g. --ner dataset,eval:eval_dataset . If no evaluation sets are specified, the --eval-split is used to determine the percentage held back for evaluation.
============================== Generating data ==============================
Components: ner
Merging training and evaluation data for 1 components
- [ner] Training: 590 | Evaluation: 84 (from datasets)
Training: 327 | Evaluation: 44
Labels: ner (7)
β Saved 327 training examples
spacy_model/train.spacy
β Saved 44 evaluation examples
spacy_model/dev.spacy
If I understand correctly, you're just looking to use your evaluation_dataset_gold as your evaluation dataset, right?
yes, correct
as shown in screenshot still data-to-spacy creates ./train and ./dev split thought it states Components: ner Merging training and evaluation data for 1 components
Not sure, what is the issue? Do i change anything in the config settings ?!!
Any thoughts...
Thanks!
Chandra
Components: ner Merging training and evaluation data for 1 components
But if your data only has ner annotations, this is what I'd expect.
Did you intend for the data in your datasets to have other annotations for a different component?
Sorry but can you clarify what's the problem?
I noticed you had the previous thread which I glanced through but didn't have a chance to go through extensively. So any additional background would be greatly appreciated.
my earlier post is bit unclear and bit confusing. When i carefully re look the **./dev.spacy ** has the evaluation data without split. Here is the output
============================== Generating data ==============================
Components: ner
Merging training and evaluation data for 1 components
- [ner] Training: 590 | Evaluation: 84 (from datasets)
Training: 327 | Evaluation: 44
Labels: ner (7)
β Saved 327 training examples
spacy_model/train.spacy
β Saved 44 evaluation examples
spacy_model/dev.spacy
============================= Generating config =============================
βΉ Auto-generating config with spaCy
Here is the observation and some doubts.
my evaluation dataset evaluation_dataset_gold has the 86 text chunks in JSONL file but here it is taken 84 (may be duplicate). Fine. why further it reduced to 44 instead of 84 ?
Able to do spacy benchmark accuracy command with --displacy-limit 86 the output (ner detection ) is parsed as HTML file has only 44 lines as expected.
Any thoughts?!
I suspect duplicates. Prodigy does data duplication by default. Can you examine the annotations (e.g., run db-out and export out the dataset to .jsonl) and see if you find any duplicates? They would be deduped (by default) by the input text, so you would likely have examples that have duplicated values for _input_hash.
I am sure among 86 entries no duplicates. Also, i have examined the annotations using the db-out and export into .jsonl file . I can see same number (86) of _input_hash' and _task_hash. Is it normal ?
I am creating the test dataset manually with target 7 labels (Personally Identifiable Information like Phone, Email,Passport Number, Bank Account, National ID number, Car Plate Num etc) info like and non-label test chunks.For example,
{"text":"Can you transfer the funds to my bank account? Here are my bank account details: 344-33165-7"}
{"text":" my bank account details: 344-33111-9 If you need further clarification mail me"}`
The text chunks usually shorter but long sentence as well. Is it okay? any suggestion.
Also the documents says about the exclude_by, Did not understand what it means or not sure it is useful for my ner.manual task.
As of v1.9, recipes can return an optional "exclude_by" setting in their "config" to specify whether
to exclude by "input" or "task" (default). Filtering and excluding by input hash is especially useful
for manual and semi-manual workflows like ner.manual and ner.correct.
What do you get? It should show the number of annotations by ACCEPT, REJECT and IGNORE.
Unfortunately, I still think the 84 -> 44 are still due to duplicates. Are you using overlapping annotations? Maybe you could try "exclude_by": "input" (see below), but I don't think that would do anything.
For example, data-to-spacy will group all annotations with the same input hash together, so you'll get one example annotated with all categories, entities, POS tags or whatever else you labelled.
I don't see any big problems or at least relating to your dedupes.
This is how deduplication is done. If set in default ("exclude_by": "task"), this means deduplication occurs for a task (individual input + unique task run). Alternatively, "exclude": "input" would mean deduplication strictly based on input. This essentially means is deduplication done by the task_hash (i.e., "exclude_by": "task") or by input_hash (i.e., "exclude_by": "input").
As seen in the Dataset Stats the 44 comes from "Accept" in the test_dataset. So, for the test dataset it is better to have NER entries with more "Accept" ... that is why it is know as Gold standard! (to test against with training data).
Thanks Ryan for your reply and links. my doubt is cleared. Cheers!