Prodigy shows examples already in DB when feed_overlap=True and using a named session

geniki · July 1, 2020, 7:38am

Annotated some examples using a custom recipe in v1.9 using a named session
Upgraded to v1.10
Restarted Prodigy server and opened the same named session to continue annotating. The first example I see is one that is already in the database. Confirmed that hashes of what is in the DB and what's being generated by the recipe are exactly the same.
If I edit the recipe to set feed_overlap=False and restart Prodigy, examples continue from where they should, i.e. no duplication with 1.

It as of Prodigy doesn't recognise that it's the same named session but there might be a different explanation of course. Any thoughts?

Thank you.

justindujardin · July 2, 2020, 5:09am

Hi @geniki,

Sorry that you're seeing unexpected behavior, I'll try reproducing it tomorrow in the morning.

If you have time before then, can you try exporting the 1.9 dataset re-importing then continue from there and see if the items still show up as duplicates? Can you also confirm the type of annotations that you're doing (e.g. Textcat, NER, Image) so I can try to get as close to your custom recipe as possible?

As a first guess, can you try explicitly including a DatasetFilter with your custom recipe? This shouldn't be needed because the overlapping feed has the same functionality, but just in case:

from prodigy.components.feeds import DatasetFilter
from prodigy import recipe
from prodigy.components import db
from prodigy.components.loaders import JSONL


@recipe(
    "custom-recipe",
    dataset=("Dataset to save answers to", "positional", None, str),
    source=("The source data as a JSON file", "positional", None, str),
)
def custom_recipe(dataset: str, source: str):
    database = db.connect()
    return {
        "dataset": dataset,
        "view_id": "classification",
        "stream": list(JSONL(source)),
        "config": {
            "feed_overlap": True,
            "feed_filters": [DatasetFilter(database, [dataset])],
        },
        "db": database,
    }

Thanks

justindujardin · July 2, 2020, 9:26pm

I think you're on to something here. Can you confirm that you're using exclude_by='input'? I was able to reproduce a problem where existing items weren't filtered using exclude_by='input' with feed_overlap=True.

I have a fix for the issue that I found, would you be willing to try a beta version to verify it works for you? If so, either respond here or send me a PM and I'll give you the link.

geniki · July 2, 2020, 10:08pm

@justindujardin thanks very much for looking into this.

Yes, I'm using exclude_by="input". Another observation, if relevant, is that 3. doesn't start from the beginning of the dataset - i.e. it's not showing me the same examples as when I create a brand new dataset.

I couldn't figure out how to send you a PM so if you could initiate it that would be great.

justindujardin · July 3, 2020, 3:24am

I sent you a message, but if you didn't receive it, you can email me at justin@explosion.ai

Topic		Replies	Views
Multiple Sessions duplicated data usage	1	524	July 24, 2019
Feed overlap not working as expected usage , solved	16	2800	October 14, 2022
non overlapping feeds without user sessions usage , streams	3	356	October 19, 2021
Option feed_overlap=false doesn't show expected behaviour usage , streams	3	1424	December 30, 2021
feed_overlap bug? done	7	1307	July 2, 2019

Prodigy shows examples already in DB when feed_overlap=True and using a named session

Related topics