db-in showing wrong number of accepted answers

The output looks wrong, right?

(Annotations) ➜  Annotations git:(master) ✗ prodigy db-in text-headlines-no-adjusted new.jsonl 

  ✨  Imported 96 annotations for 'text-headlines-no-adjusted' to database
  PostgreSQL
  Added 'accept' answer to 0 annotations
  Session ID: 2019-07-14_09-54-46

(Annotations) ➜  Annotations git:(master) ✗ prodigy stats text-headlines-no-adjusted

  ✨  Prodigy stats

Version          1.8.3                         
Location         /Users/nixd-mac/Projects/PLX/venvs/Annotations/lib/python3.7/site-packages/prodigy
Prodigy Home     /Users/nixd-mac/.prodigy      
Platform         Darwin-18.6.0-x86_64-i386-64bit
Python Version   3.7.3                         
Database Name    PostgreSQL                    
Database Id      postgresql                    
Total Datasets   3                             
Total Sessions   745                           


  ✨  Dataset 'text-headlines-no-adjusted'

Dataset       text-headlines-no-adjusted
Created       2019-07-14 09:54:38       
Description   None                      
Author        None                      
Annotations   96                        
Accept        94                        
Reject        2                         
Ignore        0

Another thing; the db-in examples are being presented as tasks as well. For some reason they are getting different hashes during db-in compared to when I run my recipe.

Excluding the internal _ fields (_view_id, _input_hash, _task_hash and _session_id) then they are 100% equal. How do I eliminate the duplication tasks?

    _input_hash                         _session_id  _task_hash _view_id  amount  answer                                               body  \
0     499936253  text-headlines-no-adjusted-default  -849168109     html  3600.0  accept  <body><p></p>\n<p></p>\n<p></p><p><span>Demand...   
1     429317851                                 NaN  1960480751      NaN  3600.0  accept  <body><p></p>\n<p></p>\n<p></p><p><span>Demand...   
    company currency             label                        meta         metric modifier multiplier  period             published  \
0   Getinge      SEK  correct-headline  {'id': 'e7c0aca51ac32042'}  PRETAX_PROFIT     None    MILLION  YEARLY  2013-Jan-11 07:33:16   
1   Getinge      SEK  correct-headline  {'id': 'e7c0aca51ac32042'}  PRETAX_PROFIT     None    MILLION  YEARLY  2013-Jan-11 07:33:16   
                                             title  year  
0   Getinge announces preliminary results for 2012  2012  
1   Getinge announces preliminary results for 2012  2012  

In general I have issues with the same task appearing multiple times although I have

exclude=['my-dataset']

and

yield prodigy.set_hashes({
    'published': es.get('metadata.published').strftime('%Y-%b-%d %H:%M:%S'),
    'title': es.get('title'),
    'body': es.get('body'),
    'company': str(headline.company),
    'year': str(headline.year),
    'period': str(headline.period),
    'metric': str(headline.metric),
    'amount': str(headline.amount),
    'multiplier': str(headline.multiplier),
    'modifier': str(headline.modifier),
    'currency': str(headline.currency),
    'meta': {
        'id': es.id,
    }
})

on each example in my stream generator.

From the example you posted, it looks like you imported 96 examples, and 94 examples had "answer": "accept" set, and 2 examples "answer": "reject"? If you load in examples without an "answer", Prodigy will automatically add it so you can use the imported data for training etc. But this didn’t seem necessarily in your case.

By default, set_hashes uses the following setting for the input keys (task properties used to generate the input hash) and task keys (task properties used to generate the task hash):

input_keys=("text", "image", "html", "input"),
task_keys=("spans", "label", "options"),

So one thing you probably want to do is customise those when you call set_hashes to include your custom fields (at least, the ones you want to take into account).

Because in your example, you actually don’t have any default input keys or default task keys present – and I think that might be the problem here. If no keys are found that can be used to generate the hash, Prodigy will simply dump the whole task as JSON, just to be 100% safe. So this can potentially lead to different hashes for tasks with only very minor differences (like the view ID etc.).