Hi I'm just trying out the prodigy nightly -- I have a dataset with over 1000 data samples. However, when I run prodigy, it runs out of data after 10 samples. If I refresh my browser, I get another 10, and it runs out again so I have to refresh the browser again. Is this expected behavior?
Hi! Which nightly version are you using and could you share your
prodigy.json configuration? Are you annotating with an existing dataset (i.e. adding more annotations to a dataset you previously created) or a new dataset?
I have the same issue and I am running version 1.11.0a4. The only configuration in my
prodigy.json is to set the sqlite database. I am annotating with a new dataset.
Same issue when using
Can you try setting the environment variable
PRODIGY_LEGACY=1? This should work around whatever might be causing this.
Hello, new on here. I received the nightly release a few days ago and have separate virtual environments set up for the stable version, the nightly release I originally received (a4) and the one I got last night (a5).
I was running into this issue as well while trying to run ner.teach on multi-word spans identified with the help of sense2vec.teach. I tried the following command in each of the three builds (stable,a4,a5):
prodigy ner.teach ner_lbl_dataset en_core_web_lg /path/to/source.txt --label LBL --patterns lbl_patterns.jsonl
My source text is several thousand examples long. In nightly a4, I found it stalled out at 10 annotations, while the stable build and a5 were not limited. Hopefully this is useful for someone!
My prodigy.json file appears to be empty when I
Glad it worked and yes, I just realised I forgot to update this thread! We added a workaround in the latest nightly that falls back to the previous feeds behaviour and should resolve this problem
Hi, I was able to not get this problem for a while by using the PRODIGY_LEGACY=1 flag, but now it is coming back up again. I'm using the nightly prodigy-1.11.0a5 now, and with or without the flag, I am getting this issue. I'm using spacy installed from the github source. My annotation data is extremely long now, over 100k, so maybe that is why? I'm also prepopulating the annotation with a model before annotating.
By extremely long, do you mean the input data or the individual examples? The number of incoming examples shouldn't matter because the stream is only consumed in batches.
Do you have a custom recipe that pre-populates the annotations? If so, is there any case where your stream might yield
None values, or does the model take very long to process the texts? Also, is there any place where you're duplicating examples (e.g. to send out multiple suggestions)? And when you look at the hashes of the outgoing examples, do you see any duplicates there?
I see. I also turned the stream into a list and sorted it -- that's the only way I can get the stream to appear in order when there's more than on entity per annotation -- so it takes a little while on start up, but then it is fine.
Yea I do prepopulate, and I include streams that don't have entities -- I think this makes sense in case the prepopulating model makes a mistake. But I'm having this issue quite regularly at once every 10 annotations, so I don't feel like this is the only issue. There are no duplicates, I have de-duplicated everything.