I have a query related to data annotation in Prodigy.
I have three datasets named ner_resume_person, ner_resume_org, and ner_resume_course. I want to merge these datasets into one. To do this, I used the db-merge command to combine them. After merging, I obtained a single merged dataset and trained the spaCy model. However, the output is not correct — the model does not capture the course data from the dataset.
the model does not capture the course data from the dataset.
Do you mean that the scores for this category in on the evaluation dataset are lower than expected? Or have you tried your model on the train dataset and it did not recognize any COURSE entites?
Could you share your evaluation results?
Some common issues to look out for:
Imbalanced entity distribution (if you have significantly fewer COURSE entities than PERSON or ORG or perhaps your eval dataset does not contain any or very few)
Inconsistent annotation patterns (e.g., "Introduction to Python" vs just "Python" for courses)
Token boundary mismatches, especially with special characters or whitespace - if spans and tokens are misaligned spaCy will discard such spans as examples
I recommend you export your data to spaCy using data-to-spacy and then run spaCy data debugging tools such as spacy debug data to see if there are any structural issues in the dataset.