db-out and data-to-spacy results differ

Hi guys,

After doing annotations with textcat.teach, I wanted to output the data so I could train my model.
When doing a db-out, I get 247 positive matches for my label, but when doing data-to-spacy I only get 60. It's the same dataset in both cases so I don't quite understand why the results differ here.

For another label there is 53 matches with db-out and 39 with data-to-spacy...

Exemplary commands I use are:
prodigy db-out nar_5_proration ./nar_5_pro.jsonl
prodigy data_to_spacy spacy_data_nar5 --lang "en" --textcat nar_5_proration

Am I missing something?

Thanks in advance for your help!

Hi! What exactly do you mean by "positive matches"?

The data-to-spacy command does a lot more under the hood than db-out, which just dumps the contents of your dataset. It also merges all annotations on the same examples, including annotations of different types. So if you've annotated the same text multiple times with different labels and different accept/reject decisions, or maybe even NER and text classification annotations, you'll end up with only one result in your corpus that contains all annotations.


by positive matches I mean accepted sentences. We used the same data to annotate four labels with textcat.teach seperately. When exporting one dataset (i.e. annotations for one of the labels) I expected to see the same amount of accepted/rejected sentences either way I generate my output.
data-to-spacy merging annotations for the same text when annotating for different labels is a great feature, that's why I would like to use it, but compared to db-out it seems like I'm "losing" accepted sentences even though I'm not yet combining datasets at this point...

You should definitely be seeing the same number of unique sentences (or, internally, unique input hashes). So if you're using textcat.teach with different labels and multiple annotation sessions, you may be asked about the same text multiple times, and in that case, you'd only end up with that text in you corpus once, with all labels combined. If there are texts that you've annotated that don't appear in the merged corpus, that'd certainly be unexpected.