Dear Prodigy Team,
We are using prodigy to classify each of a set of Chinese language documents into one of n buckets. Our problem: We want our human labeller to label each document only once, but sometimes the UI presents the same document several times. We can see this multiple suggestion issue by inspecting the table prodigy writes to, after we have labelled a sequence of documents. We did some analysis on this table using pandas (see below for code and output). Based on this, we believe the problem may be related to the batch_size
parameter: Each document seems to be presented twice, often exactly but always roughly batch_size
clicks apart. For example, if the batch_size
is 5, the 6th document seen by the labeller will be the same as the 11th document. We verified this for batch_size
values of 5 and 10. Note that the prodigy itself is aware of the duplication, in the sense that it assigns the same _input_hash
variable to identical documents. We may simply be using the wrong settings, either in the config file or the recipe itself (we have reproduced these, below). In addition to batch_size
, so far we have tried adjusting the following settings:
feed_overlap
instant_submit
However, no combination of settings solved the problem. Also, we noticed that submitting answers very quickly to the UI seems to generate more repeats in a row. Below are the recipe and config files, along with the python code we used to investigate the repeated suggestions, which revealed the “every 5th document” issue.
The code we used to discover duplicates is as below:
from prodigy.components.db import connect
import pandas as pd
db = connect()
df = pd.DataFrame(db.get_dataset('debug'))
# for a given _input_hash, get the number of rows in the dataframe that have the given input hash.
dupes_dict = {hash: len(df[df['_input_hash'] == hash])
for hash in df['_input_hash'].unique()}
# only keep pairs with dupes
dupes_only_dict = {k: v for k,v in dupes_dict.items() if v > 1 }
df['idx'] = df.index
df[df['_input_hash'].isin(dupes_only_dict.keys())].sort_values(
by=['_input_hash', 'idx'])[['text', '_input_hash', '_task_hash', 'accept', 'answer', '_timestamp']]
Result: (with batch_size
5 and instant_submit
set to false)
This is the recipe used:
import prodigy
from prodigy.components.filters import filter_duplicates
from prodigy.components.loaders import JSONL
@prodigy.recipe('debug')
def debug_labeller():
dataset = 'debug'
source = 'data/debug_dataset.jsonl'
choices = [
dict(id='mneg', text='Negative'),
dict(id='mneu', text='Neutral'),
dict(id='mpos', text='Positive')
]
def add_options(stream):
for eg in stream:
eg['options'] = choices
yield eg
stream = JSONL(source)
stream = (prodigy.set_hashes(eg, input_keys=('title', 'text'))
for eg in stream)
stream = filter_duplicates(stream, by_input=True, by_task=False)
stream = add_options(stream)
stream = list(stream)
config = {
'blocks': [{
'view_id': 'html',
'html_template': '<h3>{{title}}</h3>'
}, {
'view_id': 'choice',
}],
'instructions': './docs/instructions/instructions.html',
'choice_style': 'multiple',
'choice_auto_accept': False,
'feed_overlap': False,
'port': 8023
}
return {
'dataset': dataset,
'exclude': [dataset],
'stream': stream,
'view_id': 'blocks',
'config': config
}
This is our prodigy.json config:
{
"theme": "basic",
"custom_theme": {
"cardMaxWidth": 1920
},
"batch_size": 5,
"history_size": 10,
"host": "0.0.0.0",
"cors": true,
"db": "postgresql",
"db_settings": {
"postgresql": {
"details": "omitted"
}
},
"keymap": {
"accept": ["space"],
"save": ["command+space"],
"ignore": ["i"]
},
"validate": true,
"auto_exclude_current": true,
"instant_submit": true,
"feed_overlap": false,
"auto_count_stream": true,
"ui_lang": "en",
"project_info": [
"dataset",
"session",
"lang",
"recipe_name",
"view_id",
"label"
],
"show_stats": true,
"hide_meta": false,
"show_flag": false,
"swipe": true,
"swipe_gestures": { "right": "accept", "left": "reject" },
"split_sents_threshold": false,
"global_css": null,
"javascript": null,
"writing_dir": "ltr",
"show_whitespace": false,
"exclude_by": "input"
}
Thanks,
David