db-in command imports everything as "accept"

denyed · September 24, 2019, 7:34am

Hi,

I am using the db-in command to load some labelled examples. My data format looks like this:

{"text":"each real estate agency, broker, finder, consultant or","label":"RELEVANT","spans":[{"start":10,"end":16,"label":"RESIDENTIAL"},{"start":25,"end":31,"label":"ORG"}],"meta":{"source":"Offers","file_id":"bef00ea8-2c66-45f8-bdc1-a809a42ccbfd"},"answer":"reject"}

{"text":". All Rights Reserved. CBRE LIMITED | Real Estate Agency","label":"RELEVANT","spans":[{"start":2,"end":21,"label":"ORG"},{"start":23,"end":27,"label":"ORG"},{"start":43,"end":49,"label":"RESIDENTIAL"}],"meta":{"source":"Offers","file_id":"5819172a-c6c5-4e33-8224-30ef41a47bd1"},"answer":"accept"}

I load the data using the command:

prodigy db-in relevant_classifier relevant_classifier.jsonl

but then every answer is imported as "accept".

Am I doing something incorrectly or is this a bug in the db-in command?

ines · September 24, 2019, 7:55am

What's the message you see after importing? And when you look at the imported examples in your dataset, do you see the same examples you've imported, but with a different answer?

The db-in command should only add the "answer" key if an example doesn't have it set, or if you're explicitly setting --overwrite. If you look at the source of the recipe in prodigy/__main__.py, you'll also see that the answers is only overwritten if "answer" not in task. So if you want to double-check if examples in your input data are overwritten, you could add a print statement here and see where answers get added.

denyed · September 24, 2019, 8:17am

If I use the --overwrite flag I get
Added 'accept' answer to 296 annotations

If I don't include the flag I get:
Added 'accept' answer to 0 annotations

ines · September 24, 2019, 8:37am

Thanks for checking! This makes sense then – if you're not overwriting the answer, the existing answer is used and "accept" (or whatever else you define on the command line) isn't added to any of the examples. If you're overwriting, it's added to all of them. If you check out the imported examples in the dataset, they should all have their original answer when you're not setting --overwrite.

(Btw, not sure if you've been using --dry to perform a dry run when testing the importing. But if not, you probably want to start with a fresh dataset again to make sure you don't have the same data imported several times with different answers.)

denyed · September 24, 2019, 8:46am

Ah Ok, I get it now.

When it says Added 'accept' answer to 0 annotations, that doesn't mean that it has marked every imported value as "reject", but rather it has imported the values where there is an "answer" key as the value of that key.

So does that mean, if I want to re-import data that I have updated in some way with different "accept/reject" answers, but for each of the same "text" values, I should rather create a new dataset as
it will append the data (even if it is the same) and then the overwrite flag will only overwrite the answers with what I specify on the command line?

Thanks very much for your help!

ines · September 24, 2019, 9:04am

Yes, exactly – and now I also understand where the confusion came from, sorry! I think it would be clearer if the output also said something like "Kept existing answer for X annotations". The main purpose of adding the answer is to make sure each annotation has one – because that's one of the built-in assumptions in Prodigy.

Yes, datasets are append-only in general and Prodigy will never silently overwrite your data. If you want to edit existing annotations in a dataset, you should create a new one (and delete the old one if you like – but you might also keep it so you can always go back and reproduce previous results if you made a mistake).

The --overwrite flag will overwrite the "answer" values with the value of the --answer argument you set on the command line (by default "accept"). So even if one of your examples specifies "answer": "reject", it'll be imported with "answer": "accept".

Topic		Replies	Views
Loading records via db-in aren't accepted database , solved	3	962	March 16, 2018
Issues with db-in and CSV usage , database , solved	1	657	June 17, 2020
Implications of "answer": "accept" being automatically added to all imported examples with db-in database , spacy , pos	4	346	March 21, 2023
Importing existing text classification data with binary labels usage , solved	1	556	November 28, 2017
Feature Request: db-in accepts a directory and imports all jsonl files in the directory enhancement	2	101	May 14, 2024

db-in command imports everything as "accept"

Related topics