There are some options available to Prodigy in this situation.
You can choose to write a custom recipe which uses a custom mechanism to fetch examples. Since custom recipes are just Python code, nothing is stopping you from writing a generator that fetches data from a database that has a table that receives updates. This way, you won't need to restart the server.
Alternatively, you may also point to a dataset that's been loaded in the Prodigy database instead of pointing to an examples.jsonl file. Theoretically that means you could also do something like this:
Notice that dataset:-prefix there? That points to a dataset in the database. That also means that while somebody is annotating you can do a db-in to append more examples to the dataset-to-annotate.
Technically, this would also prevent a restart but I imagine there is an edge case where an annotator may get to the end of their queue before new data is added.
Main Advice
Both of these techniques should work, but I've never tried it extensively so there may be some edge cases that I can't come up with right now.
If I had to advise ... when you're dealing with a data source and you'd want to update the examples in a custom way then a custom recipe is probably the best path forward. It will require a small investment, but writing a bespoke Python solution usually pays off.
Just to check: is there a reason why a restart via CRON during the evening won't work for you?
Thanks for the very prompt response, @koaning, as usual! The prodigy support forum is one of the greatest things of this tool so far !
I was trying to make the dataset: idea work, but what I found through some tests is that something like:
does not work, because it adds the examples with "answer": "ACCEPTED" already, therefore it does not distribute the new texts to another, it rather records as already annotated. Can you advise?
I think it's a slightly different use case. I want to be able to add datapoints in an ad-hoc way to an annotation task living in a server. I am able to run db-in from my local machine, but if this doesn't update the dataset, I'll have to "ssh" in, stop the server and re-run, or something similar, which doesn't sound very useful.
If none of the above works, yes, I'd invest in a custom recipe, but I wanted to exhaust all the "classical" pathways before going into that.
Ah, good point. I hadn't forseen that. This sounds to me like an issue that we should fix. It makes sense to me that a user might want to upload a dataset to the database for annotation and then there should be no answer flag. Will add a ticket internally to discuss this.
As an alternative for now, I think you should still be able to use the Python API instead. Something like this might work?
That was promising but STILL doesn't work as intended. Full Script her:
from prodigy.components.db import connect
from prodigy.util import set_hashes
import srsly
examples = srsly.read_jsonl("test.jsonl")
db = connect()
db.add_dataset("test")
hashed_examples = [set_hashes(eg) for eg in list(examples)]
db.add_examples(hashed_examples, ["test"])
print(f"Added {len(hashed_examples)} examples to 'test' dataset")
And then when running prodigy rel.manual test en_core_web_sm dataset:test --label TEST
I seem to get "No tasks available". The examples seem to have added as "already annotated"
I think essentially what I am getting is that there seems to be no obvious way to make the dataset: idea work, because we can't import examples and then make prodigy get streams from the dataset. Is that intended or should that behaviour change?
I worry that something else might be up. I'll show what I've done to try and replicate this issue.
I start out with this examples.jsonl file.
{"text": "hello my name is james"}
{"text": "hello my name is john"}
{"text": "hello my name is robert"}
{"text": "hello my name is michael"}
{"text": "hello my name is william"}
{"text": "hello my name is mary"}
{"text": "hello my name is david"}
{"text": "hello my name is richard"}
{"text": "hello my name is joseph"}
I use db-in to load it.
python -m prodigy db-in issue-6876 examples.jsonl
This dataset is now in the Prodigy database and I can confirm via db-out that it also has hashes and an accept key.
python -m prodigy db-out issue-6876
This yields:
{"text":"hello my name is james","_input_hash":-1294982232,"_task_hash":465224705,"answer":"accept"}
{"text":"hello my name is john","_input_hash":-1282353592,"_task_hash":980664449,"answer":"accept"}
{"text":"hello my name is robert","_input_hash":803732484,"_task_hash":-658149367,"answer":"accept"}
{"text":"hello my name is michael","_input_hash":-817151764,"_task_hash":-831281946,"answer":"accept"}
{"text":"hello my name is william","_input_hash":-1774976813,"_task_hash":-349743654,"answer":"accept"}
{"text":"hello my name is mary","_input_hash":-1806564888,"_task_hash":910542438,"answer":"accept"}
{"text":"hello my name is david","_input_hash":-630231151,"_task_hash":-1900462764,"answer":"accept"}
{"text":"hello my name is richard","_input_hash":-277245364,"_task_hash":-1834276944,"answer":"accept"}
{"text":"hello my name is joseph","_input_hash":467632156,"_task_hash":-1782316920,"answer":"accept"}
But here's the thing, I seem to be able to annotate this data just fine.
So, on my side at least, the accept key doesn't block the annotations. Can you confirm this on your end as well on this demo dataset? I'm curious if perhaps there's something else happening on your machine. Do you have extra settings in a config.json file?
Is it possible that a dataset called "test" already existed before you ran your script?
Interesting. Yes, with your example, I can still annotate it again (which is somewhat disconcerting, probably not the intended behaviour for the accepted key?). I can also confirm that ner.manual and rel.manual also work! No idea what's going on.
Note how the dataset is repeated. If I replace the first one to current-dataset than it all works.
The quirk is that db-in STILL imports as accepted inside dataset-to-be-annotated, however it is available to annotate in current-dataset, which is quite interesting. I wonder if the --answer parameter in db-in should allow me to set no answer (--answer null or sonething) to avoid confusion.