Not all annotated samples received using Prodigy SDK

I am using the following code in order to receive annotated samples by Prodigy SDK:

from prodigy.components.db import connect

db = connect("mysql", {
    "host": "my_host_name",
    "user": "my_user_name",
    "passwd": "my_password",
    "db": "prodigy"
    }
)

for sample in db.get_dataset("my_dataset_name"):
    print(sample)

but I do not get all the annotated samples. I came to such a conclusion because when I connect directly to the "prodigy" database using MySQL and query "example" table filtering by same dataset name (which was used in the SDK "my_dataset_name") I get more samples. Additional info: sessions were used during the annotation process. Thank you!

Hi! What query did you run to filter the examples when you accessed the MySQL database directly?

Prodigy's default database model stores all annotations you collect in the example table and then uses the link table to link the example IDs to one or more datasets. This way, the same example can appear in multiple datasets but only has to be stored once. This is something you need to consider when working with the database directly. The source of the database handler is also shipped with Prodigy, so you can look at the get_dataset method in prodigy/components/db.py to see the query that's used (via the peewee ORM).

Does it mean that Prodigy understands that it is same sample in spite of the fact that it was used in various datasets? Does Prodigy do content matching between a new sample and already existing samples in database?

First I downloaded all the samples by executing SQL command: "select content from example".
Then I manually filtered by dataset name and session name in order to get the desired samples.
And the strange thing that the missing samples do appear in input dataset (when I executed prodigy server), they do appear in the downloaded data by the SQL query (and manual filtering) above but they do not appear when I use Prodigy SDK.

This is all decided based on the task_hash, which is also how Prodigy can detect duplicates internally. The task hash is assigned based on the raw input and annotations, and can also be customised in the recipes (e.g. if your recipe assigns custom properties and you want to implement your own logic to determine whether two tasks are the same).

So how did you filter by dataset? Did you match up the example ID with the link ID, and then look that up in the dataset table? Or did you just go by the _session_id in the JSONL data?

Here's the query peewee executes in get_dataset:

query = (
    Example.select().join(Link).join(Dataset).where(Dataset.id == dataset.id)
)

Do you have an example of a task that's in the database and input data but not returned when the DB is queried by peewee? Is there anything that stands out or looks different about it?

Yes, I see that a sample with some _session_id appears in the data when I download it via SQL with direct connection to prodigy database. The problem occurs in the first case when get_dataset function with dataset name name=[dataset_name] is used. I expect to receive all the annotated samples from all the sessions (annotator users), but as I already mentioned not all the samples are received as an output. And that sample is missing. But in the second case using get_dataset function with dataset name name=[dataset_name + session_name] the desired sample does appear! And it is what I expect to see. Is it correct Prodigy behavior?

Hi @Yuri,

I'm trying to reproduce the problem that you're having, and I need some help. I tried to reproduce your export/filter test but the SQL command you gave (select content from example;) doesn't have the information needed to filter examples to a dataset or session. Can you share the complete SQL command you used?

To further debug, I set up a prodigy recipe file as a test case below. I hope you can help tweak it so that I get similar results to yours. The file contains two recipes (get_dataset_api and get_dataset_sql) that query the database to get examples from a generated test dataset. The dataset has answers from multiple sessions, and answers without a session.

recipe.py

import prodigy

dataset = "test-dataset"
num_sessions = 5
examples_per_session = 10


@prodigy.recipe("get_dataset_api")
def get_dataset_using_db_adapter():

    db: prodigy.components.db.Database = prodigy.components.db.connect()
    make_dataset(db)
    # Get all examples
    all_examples = db.get_dataset(dataset)
    print(f"Database: get_dataset({dataset}) = {len(list(all_examples))}")
    # for ex in all_examples:
    #     print(ex)

    # Show sessions individually (0-3)
    for i in range(num_sessions):
        session_id = f"session_{i}"
        session_examples = db.get_dataset(session_id)
        print(f"Database: get_dataset({session_id}) == {len(list(session_examples))}")
        # for sample in session_examples:
        #     print(sample)


@prodigy.recipe("get_dataset_sql")
def get_dataset_using_raw_sql():
    def query_str(dataset_name: str) -> str:
        return f"""
    SELECT 
        eg.id, eg.input_hash, eg.task_hash, eg.content
    FROM
        example AS eg
            INNER JOIN
        link AS link ON (eg.id = link.example_id)
            INNER JOIN
        dataset AS dataset ON (link.dataset_id = dataset.id)
    WHERE
        (dataset.name = '{dataset_name}');
    """

    db: prodigy.components.db.Database = prodigy.components.db.connect()
    make_dataset(db)
    # Get all examples
    cursor = db.db.execute_sql(query_str(dataset))
    all_examples = cursor.fetchall()
    print(f"SQL: get_dataset({dataset}) = {len(list(all_examples))}")
    # for ex in all_examples:
    #     print(ex)

    # Show sessions individually (0-3)
    for i in range(num_sessions):
        session_id = f"session_{i}"
        cursor = db.db.execute_sql(query_str(session_id))
        session_examples = cursor.fetchall()
        print(f"SQL: get_dataset({session_id}) == {len(list(session_examples))}")
        # for sample in session_examples:
        #     print(sample)


def make_example(session_id=None, i=5):
    obj = {
        "text": f"Example {i}",
        "label": "LABEL",
        "answer": "accept",
    }
    if session_id is not None:
        obj[prodigy.util.SESSION_ID_ATTR] = session_id
    return prodigy.util.set_hashes(obj)


def make_dataset(db: prodigy.components.db.Database):
    """Make a disposable test dataset in the prodigy database"""
    if dataset in db:
        db.drop_dataset(dataset)
    assert db.get_dataset(dataset) is None

    # Insert a batch of examples without a session ID
    non_session_examples = [make_example(None, i) for i in range(examples_per_session)]
    db.add_examples(non_session_examples, [dataset])

    # Insert (n) batches of examples, with different session IDs
    for i in range(num_sessions):
        session_id = f"session_{i}"
        session_examples = [
            make_example(session_id, i) for i in range(examples_per_session)
        ]
        db.add_examples(session_examples, [dataset, session_id])

The first recipe queries for the examples using the prodigy Database class, and you can run it:

$ prodigy get_dataset_api -F recipe.py
Database: get_dataset(test-dataset) = 60
Database: get_dataset(session_0) == 10
Database: get_dataset(session_1) == 10
Database: get_dataset(session_2) == 10
Database: get_dataset(session_3) == 10
Database: get_dataset(session_4) == 10

The second recipe queries for the examples using raw SQL statements, and you can run it:

$ prodigy get_dataset_sql -F recipe.py
SQL: get_dataset(test-dataset) = 60
SQL: get_dataset(session_0) == 10
SQL: get_dataset(session_1) == 10
SQL: get_dataset(session_2) == 10
SQL: get_dataset(session_3) == 10
SQL: get_dataset(session_4) == 10

Hopefully, you can confirm the same results I see here, and maybe it helps tell us what is different in your setup. I also wonder, what version of prodigy are you using? Thanks!