@DanilKonon thanks for the extra information! I used the annotations and prodigy.json configuration you linked, but I still can't reproduce the duplicates.
I actually think I need a few lines from ./michaelkors.jsonl
instead of the db-out annotations to try reproducing your problem. Like you pointed out when the server receives the duplicates, it only stores one of them in the database, so using db-out to send the annotations doesn't include the duplicate that caused the problem. Can you attach the first 100 lines of ./michaelkors.jsonl
even if it only has file paths instead of image data?
Since it's not been easy to reproduce the problem, can I ask about the latency to your MySQL database? It can be hard to estimate, so I wrote a small recipe that will report on it. If you could run it and let me know the results, it would be helpful:
timing.py
import time
from prodigy.components.db import connect
from prodigy.core import recipe
from prodigy.util import color, get_timestamp_session_id, msg, set_hashes
@recipe("timing")
def timing():
"""Report on the observed latency of databse operations"""
def to_elapsed(time1, time2) -> str:
delta = time2 - time1
text = "{:.3f}s".format(delta)
if delta < 1.0:
colr = "green"
elif delta < 3.0:
colr = "yellow"
else:
colr = "red"
return color(text, colr)
DB = connect()
num = 250
annotations = [
{"text": f"Text {i}", "answer": "accept", "label": "TEST_LABEL"}
for i in range(num)
]
annotations = [set_hashes(eg) for eg in annotations]
set_id = f"heatlh-recipe-{get_timestamp_session_id()}"
if set_id in DB:
DB.drop_dataset(set_id)
start = time.time()
DB.add_dataset(set_id)
stop = time.time()
t_add_dataset = to_elapsed(start, stop)
start = time.time()
DB.add_examples(annotations, datasets=[set_id])
stop = time.time()
t_add_examples = to_elapsed(start, stop)
start = time.time()
DB.drop_dataset(set_id, batch_size=100)
stop = time.time()
t_drop_dataset = to_elapsed(start, stop)
msg.divider("Database Timing")
msg.table(
[
["Create a Dataset", t_add_dataset],
[f"Add {num} Examples", t_add_examples],
["Drop the Dataset", t_drop_dataset],
]
)
You can run it like:
prodigy timing -F timing.py
It should print some results like:
============================== Database Timing ==============================
Create a Dataset 0.004s
Add 250 Examples 0.120s
Drop the Dataset 0.117s