Hi I'm just trying out the prodigy nightly -- I have a dataset with over 1000 data samples. However, when I run prodigy, it runs out of data after 10 samples. If I refresh my browser, I get another 10, and it runs out again so I have to refresh the browser again. Is this expected behavior?
Hi! Which nightly version are you using and could you share your prodigy.json
configuration? Are you annotating with an existing dataset (i.e. adding more annotations to a dataset you previously created) or a new dataset?
I have the same issue and I am running version 1.11.0a4. The only configuration in my prodigy.json
is to set the sqlite database. I am annotating with a new dataset.
Same issue when using ner.teach
Can you try setting the environment variable PRODIGY_LEGACY=1
? This should work around whatever might be causing this.
Hello, new on here. I received the nightly release a few days ago and have separate virtual environments set up for the stable version, the nightly release I originally received (a4) and the one I got last night (a5).
I was running into this issue as well while trying to run ner.teach on multi-word spans identified with the help of sense2vec.teach. I tried the following command in each of the three builds (stable,a4,a5):
prodigy ner.teach ner_lbl_dataset en_core_web_lg /path/to/source.txt --label LBL --patterns lbl_patterns.jsonl
My source text is several thousand examples long. In nightly a4, I found it stalled out at 10 annotations, while the stable build and a5 were not limited. Hopefully this is useful for someone!
My prodigy.json file appears to be empty when I cat
it.
Glad it worked and yes, I just realised I forgot to update this thread! We added a workaround in the latest nightly that falls back to the previous feeds behaviour and should resolve this problem
Hi, I was able to not get this problem for a while by using the PRODIGY_LEGACY=1 flag, but now it is coming back up again. I'm using the nightly prodigy-1.11.0a5 now, and with or without the flag, I am getting this issue. I'm using spacy installed from the github source. My annotation data is extremely long now, over 100k, so maybe that is why? I'm also prepopulating the annotation with a model before annotating.
By extremely long, do you mean the input data or the individual examples? The number of incoming examples shouldn't matter because the stream is only consumed in batches.
Do you have a custom recipe that pre-populates the annotations? If so, is there any case where your stream might yield None
values, or does the model take very long to process the texts? Also, is there any place where you're duplicating examples (e.g. to send out multiple suggestions)? And when you look at the hashes of the outgoing examples, do you see any duplicates there?
I see. I also turned the stream into a list and sorted it -- that's the only way I can get the stream to appear in order when there's more than on entity per annotation -- so it takes a little while on start up, but then it is fine.
Yea I do prepopulate, and I include streams that don't have entities -- I think this makes sense in case the prepopulating model makes a mistake. But I'm having this issue quite regularly at once every 10 annotations, so I don't feel like this is the only issue. There are no duplicates, I have de-duplicated everything.
I'm also seeing this with prodigy-1.11.0a5. I'm running:
$ prodigy rel.manual ner_exp_restr_dep en_core_web_lg ./output.jsonl \
--label HAS_COSTS,IN_YEAR \
--span-label EXP_RESTR,MONEY,DATE \
--add-ents \
--wrap
My data file:
$ cat output.jsonl | wc -l
3517
$ head -3 output.jsonl
{"text": "Due to the high fixed cost structure associated with the Retail segment, a decline in sales or the closure or poor performance of individual or multiple stores could result in significant lease termination costs, write-offs of equipment and leasehold improvements, and severance costs.", "spans": [{"start": 269, "end": 284, "label": "EXP_RESTR"}]}
{"text": "The Company regularly reviews its investment portfolio to determine if any security is other-than-temporarily impaired, which would require the Company to record an impairment charge in the period any such determination is made.", "spans": [{"start": 165, "end": 182, "label": "EXP_RESTR"}]}
{"text": "Other income and expense also could vary materially from expectations depending on gains or losses realized on the sale or exchange of financial instruments; impairment charges resulting from revaluations of debt and equity securities and other investments; interest rates; cash balances; and changes in fair value of derivative instruments.", "spans": [{"start": 158, "end": 176, "label": "EXP_RESTR"}]}
When I go to localhost:8080 only 10 annotations are available.
@jamiehannaford Thanks for the details, I was able to reproduce this and I think I found the problem. Just building a new nightly that should fix the issue
Thanks @ines that seems to have fixed it
Great to hear it, thanks for reporting back!