Prodigy shows examples already in DB when feed_overlap=True and using a named session

  1. Annotated some examples using a custom recipe in v1.9 using a named session
  2. Upgraded to v1.10
  3. Restarted Prodigy server and opened the same named session to continue annotating. The first example I see is one that is already in the database. Confirmed that hashes of what is in the DB and what's being generated by the recipe are exactly the same.
  4. If I edit the recipe to set feed_overlap=False and restart Prodigy, examples continue from where they should, i.e. no duplication with 1.

It as of Prodigy doesn't recognise that it's the same named session but there might be a different explanation of course. Any thoughts?

Thank you.

Hi @geniki,

Sorry that you're seeing unexpected behavior, I'll try reproducing it tomorrow in the morning.

If you have time before then, can you try exporting the 1.9 dataset re-importing then continue from there and see if the items still show up as duplicates? Can you also confirm the type of annotations that you're doing (e.g. Textcat, NER, Image) so I can try to get as close to your custom recipe as possible?

As a first guess, can you try explicitly including a DatasetFilter with your custom recipe? This shouldn't be needed because the overlapping feed has the same functionality, but just in case:

from prodigy.components.feeds import DatasetFilter
from prodigy import recipe
from prodigy.components import db
from prodigy.components.loaders import JSONL


@recipe(
    "custom-recipe",
    dataset=("Dataset to save answers to", "positional", None, str),
    source=("The source data as a JSON file", "positional", None, str),
)
def custom_recipe(dataset: str, source: str):
    database = db.connect()
    return {
        "dataset": dataset,
        "view_id": "classification",
        "stream": list(JSONL(source)),
        "config": {
            "feed_overlap": True,
            "feed_filters": [DatasetFilter(database, [dataset])],
        },
        "db": database,
    }

Thanks

I think you're on to something here. Can you confirm that you're using exclude_by='input'? I was able to reproduce a problem where existing items weren't filtered using exclude_by='input' with feed_overlap=True.

I have a fix for the issue that I found, would you be willing to try a beta version to verify it works for you? If so, either respond here or send me a PM and I'll give you the link. :bowing_man:

@justindujardin thanks very much for looking into this.

Yes, I'm using exclude_by="input". Another observation, if relevant, is that 3. doesn't start from the beginning of the dataset - i.e. it's not showing me the same examples as when I create a brand new dataset.

I couldn't figure out how to send you a PM so if you could initiate it that would be great.

1 Like

I sent you a message, but if you didn't receive it, you can email me at justin@explosion.ai