db-out and data-to-spacy results differ

DominikN · February 22, 2021, 3:34pm

Hi guys,

After doing annotations with textcat.teach, I wanted to output the data so I could train my model.
When doing a db-out, I get 247 positive matches for my label, but when doing data-to-spacy I only get 60. It's the same dataset in both cases so I don't quite understand why the results differ here.

For another label there is 53 matches with db-out and 39 with data-to-spacy...

Exemplary commands I use are:
prodigy db-out nar_5_proration ./nar_5_pro.jsonl
prodigy data_to_spacy spacy_data_nar5 --lang "en" --textcat nar_5_proration

Am I missing something?

Thanks in advance for your help!

ines · February 23, 2021, 11:19am

Hi! What exactly do you mean by "positive matches"?

The data-to-spacy command does a lot more under the hood than db-out, which just dumps the contents of your dataset. It also merges all annotations on the same examples, including annotations of different types. So if you've annotated the same text multiple times with different labels and different accept/reject decisions, or maybe even NER and text classification annotations, you'll end up with only one result in your corpus that contains all annotations.

DominikN · February 23, 2021, 1:36pm

Hey,

by positive matches I mean accepted sentences. We used the same data to annotate four labels with textcat.teach seperately. When exporting one dataset (i.e. annotations for one of the labels) I expected to see the same amount of accepted/rejected sentences either way I generate my output.
data-to-spacy merging annotations for the same text when annotating for different labels is a great feature, that's why I would like to use it, but compared to db-out it seems like I'm "losing" accepted sentences even though I'm not yet combining datasets at this point...

ines · February 23, 2021, 11:49pm

You should definitely be seeing the same number of unique sentences (or, internally, unique input hashes). So if you're using textcat.teach with different labels and multiple annotation sessions, you may be asked about the same text multiple times, and in that case, you'd only end up with that text in you corpus once, with all labels combined. If there are texts that you've annotated that don't appear in the merged corpus, that'd certainly be unexpected.

Topic		Replies	Views
Exporting dataset from prodigy and train textcat in spaCy v3 textcat , done , spacy	6	850	August 12, 2021
textcat_multilabel with only some labels annotated for some examples	5	333	June 14, 2022
Can't improve textcat model performance textcat	2	357	May 3, 2020
Losing spancat labels when training after using prodigy db-merge spacy , spancat	12	226	January 3, 2024
textcat.batch-train versus spacy classificaion example usage , textcat , spacy	4	518	March 30, 2019

db-out and data-to-spacy results differ

Related Topics