db-in command imports everything as "accept"


I am using the db-in command to load some labelled examples. My data format looks like this:

{"text":"each real estate agency, broker, finder, consultant or","label":"RELEVANT","spans":[{"start":10,"end":16,"label":"RESIDENTIAL"},{"start":25,"end":31,"label":"ORG"}],"meta":{"source":"Offers","file_id":"bef00ea8-2c66-45f8-bdc1-a809a42ccbfd"},"answer":"reject"}

{"text":". All Rights Reserved. CBRE LIMITED | Real Estate Agency","label":"RELEVANT","spans":[{"start":2,"end":21,"label":"ORG"},{"start":23,"end":27,"label":"ORG"},{"start":43,"end":49,"label":"RESIDENTIAL"}],"meta":{"source":"Offers","file_id":"5819172a-c6c5-4e33-8224-30ef41a47bd1"},"answer":"accept"}

I load the data using the command:

prodigy db-in relevant_classifier relevant_classifier.jsonl

but then every answer is imported as "accept".

Am I doing something incorrectly or is this a bug in the db-in command?

What's the message you see after importing? And when you look at the imported examples in your dataset, do you see the same examples you've imported, but with a different answer?

The db-in command should only add the "answer" key if an example doesn't have it set, or if you're explicitly setting --overwrite. If you look at the source of the recipe in prodigy/__main__.py, you'll also see that the answers is only overwritten if "answer" not in task. So if you want to double-check if examples in your input data are overwritten, you could add a print statement here and see where answers get added.

If I use the --overwrite flag I get
Added 'accept' answer to 296 annotations

If I don't include the flag I get:
Added 'accept' answer to 0 annotations

Thanks for checking! This makes sense then – if you're not overwriting the answer, the existing answer is used and "accept" (or whatever else you define on the command line) isn't added to any of the examples. If you're overwriting, it's added to all of them. If you check out the imported examples in the dataset, they should all have their original answer when you're not setting --overwrite.

(Btw, not sure if you've been using --dry to perform a dry run when testing the importing. But if not, you probably want to start with a fresh dataset again to make sure you don't have the same data imported several times with different answers.)

Ah Ok, I get it now.

When it says Added 'accept' answer to 0 annotations, that doesn't mean that it has marked every imported value as "reject", but rather it has imported the values where there is an "answer" key as the value of that key.

So does that mean, if I want to re-import data that I have updated in some way with different "accept/reject" answers, but for each of the same "text" values, I should rather create a new dataset as
it will append the data (even if it is the same) and then the overwrite flag will only overwrite the answers with what I specify on the command line?

Thanks very much for your help!

Yes, exactly – and now I also understand where the confusion came from, sorry! I think it would be clearer if the output also said something like "Kept existing answer for X annotations". The main purpose of adding the answer is to make sure each annotation has one – because that's one of the built-in assumptions in Prodigy.

Yes, datasets are append-only in general and Prodigy will never silently overwrite your data. If you want to edit existing annotations in a dataset, you should create a new one (and delete the old one if you like – but you might also keep it so you can always go back and reproduce previous results if you made a mistake).

The --overwrite flag will overwrite the "answer" values with the value of the --answer argument you set on the command line (by default "accept"). So even if one of your examples specifies "answer": "reject", it'll be imported with "answer": "accept".

1 Like