And after started batch train
python3 -m prodigy ner.batch-train resume_ner en_core_web_lg --output resume-model --label “SKILL,ROLE,EMPLOYER” --eval-split 0.2 --n-iter 6 --batch-size 8
Which results
ValueError: [E024] Could not find an optimal move to supervise the parser. Usually, this means the GoldParse was not correct. For example, are all labels added to the model?
Hi! Did you use the same Prodigy version for annotation and training, or did you collect the annotations in a previous version?
It's likely that this is related to the update to spaCy v2.1, which is stricter about gold standard data and constraints for the parser and named entity recognizer. See my reply from this thread:
So you might want to double-check the data and see if you have any "illegal" spans in there. It's usually pretty rare and removing them should be no problem, because in most cases, they'd be rejected suggestions anyway.
Thanks for the prompt response. Which is very helpful.
In my understanding, with in the trained model I have taken ‘Artificial Intelligence’, ‘Machine Learning’ as single entities under the label SKILL. So this is the reason for the error?
If that was the reason, is there any way to handle such kind of multiple word entities / tokens?
Multi-word entities are no problem – in fact, this is one of the key features of NER.
But spaCy now explicitly raises errors for spans that start or end with whitespace characters, or consist of only whitespace. So "Artificial Intelligence" is totally fine – but an annotated entity for "\nArtificial Intelligence" or "\n" would be invalid.
Completed training model. It is giving below error when tried checking accuracy of model.
ValueError: [E024] Could not find an optimal move to supervise the parser. Usually, this means the GoldParse was not correct. For example, are all labels added to the model?
I am found the error because white-spaces.
Any possible to remove white-spaces please send me any reference link (or) code.
You can export your dataset by running the db-out command and then check the JSONL file:
prodigy db-out resume_ner > resume_ner.jsonl
After you’ve removed the problematic spans or have corrected them, you can then reimport the data to a new dataset:
prodigy db-in resume_ner_fixed resume_ner.jsonl
You can probably also write a script to find the problematic entities automatically and then exclude them, and add the result to a new dataset. I haven’t tested this yet, but something like this should work:
from prodigy.components.db import connect
db = connect()
examples = db.get_dataset("resume_ner")
fixed_examples = []
def is_whitespace_entity(text):
whitespace = (" ", "\n") # etc.
if text.startswith(whitespace) or text.endswith(whitespace):
return True
for char in whitespace:
if text == char:
return True
return False
for eg in examples:
new_spans = []
for span in eg.get("spans", []):
entity = eg["text"][span["start"]:span["end"]]
if not is_whitespace_entity(entity):
new_spans.append(span)
eg["spans"] = new_spans
fixed_examples.append(eg)
db.add_dataset("resume_ner_fixed")
db.add_examples(fixed_examples, ["resume_ner_fixed"])
@pathapatisivayya There’s not really a good general-purpose answer to that. If we could know automatically what would improve accuracy, we’d implement that as a single script.
Some things you could try:
Try running ner.train-curve, to see whether accuracy is improving as more data is used. If so, you can try continuing to annotate.
If your annotations are complete, you can try adding the --no-missing flag. If they’re not complete, you can try running ner.silver-to-gold to make sure there are no missing entities.
You can try starting from a blank model, or training word vectors on your data.
You can try using spacy pretrain to learn an initial vector representation.
You can try analysing your errors, and either building a rule-based dictionary, or refining your annotation scheme.
You can try training a text classifier to filter out irrelevant texts, that might distract your model.
I’m sorry but I don’t think I understand your question.
We really can’t provide much project-specific advice, as this crosses past questions of how to use Prodigy, into much more general questions around how to solve specific problems with NLP or ML technologies.
If you need urgent help with your project, you might try posting a request to hire a freelancer in the consultants thread: spaCy/prodigy consultants?