Another thing; the db-in examples are being presented as tasks as well. For some reason they are getting different hashes during db-in compared to when I run my recipe.
Excluding the internal _ fields (_view_id, _input_hash, _task_hash and _session_id) then they are 100% equal. How do I eliminate the duplication tasks?
_input_hash _session_id _task_hash _view_id amount answer body \
0 499936253 text-headlines-no-adjusted-default -849168109 html 3600.0 accept <body><p></p>\n<p></p>\n<p></p><p><span>Demand...
1 429317851 NaN 1960480751 NaN 3600.0 accept <body><p></p>\n<p></p>\n<p></p><p><span>Demand...
company currency label meta metric modifier multiplier period published \
0 Getinge SEK correct-headline {'id': 'e7c0aca51ac32042'} PRETAX_PROFIT None MILLION YEARLY 2013-Jan-11 07:33:16
1 Getinge SEK correct-headline {'id': 'e7c0aca51ac32042'} PRETAX_PROFIT None MILLION YEARLY 2013-Jan-11 07:33:16
title year
0 Getinge announces preliminary results for 2012 2012
1 Getinge announces preliminary results for 2012 2012
In general I have issues with the same task appearing multiple times although I have
From the example you posted, it looks like you imported 96 examples, and 94 examples had "answer": "accept" set, and 2 examples "answer": "reject"? If you load in examples without an "answer", Prodigy will automatically add it so you can use the imported data for training etc. But this didn't seem necessarily in your case.
By default, set_hashes uses the following setting for the input keys (task properties used to generate the input hash) and task keys (task properties used to generate the task hash):
So one thing you probably want to do is customise those when you call set_hashes to include your custom fields (at least, the ones you want to take into account).
Because in your example, you actually don't have any default input keys or default task keys present – and I think that might be the problem here. If no keys are found that can be used to generate the hash, Prodigy will simply dump the whole task as JSON, just to be 100% safe. So this can potentially lead to different hashes for tasks with only very minor differences (like the view ID etc.).