Adding new data to be annotated without re-starting the server

Please let me know if my understanding is completely flawed. At the moment, you can add a jsonl files to be annotated for (say) NER by running:

prodigy ner.manual my_dataset en_core_web_sm my_file.jsonl --label my_labels

which will then run the server.

Whereas you can add annotated data do the dataset my_dataset used the db-in command. My question is the following:

Suppose that I now want to add new data to my dataset. Is the only way to do it to "bring the server down" and then reload it with

prodigy ner.manual my_dataset en_core_web_sm my_new_file.jsonl --label my_labels?

Is there no way to add text with no labels to be annotated through db-in or any other recipe?

There are some options available to Prodigy in this situation.

  1. You can choose to write a custom recipe which uses a custom mechanism to fetch examples. Since custom recipes are just Python code, nothing is stopping you from writing a generator that fetches data from a database that has a table that receives updates. This way, you won't need to restart the server.
  2. Alternatively, you may also point to a dataset that's been loaded in the Prodigy database instead of pointing to an examples.jsonl file. Theoretically that means you could also do something like this:
python -m prodigy db-in dataset-to-annotate examples.jsonl
python -m prodigy ner.manual dataset-with-annotations en_core_web_sm dataset:dataset-to-annotate --label mylabels

Notice that dataset:-prefix there? That points to a dataset in the database. That also means that while somebody is annotating you can do a db-in to append more examples to the dataset-to-annotate.

python -m prodigy db-in dataset-to-annotate moar-examples.jsonl

Technically, this would also prevent a restart but I imagine there is an edge case where an annotator may get to the end of their queue before new data is added.

Main Advice

Both of these techniques should work, but I've never tried it extensively so there may be some edge cases that I can't come up with right now.

If I had to advise ... when you're dealing with a data source and you'd want to update the examples in a custom way then a custom recipe is probably the best path forward. It will require a small investment, but writing a bespoke Python solution usually pays off.

Just to check: is there a reason why a restart via CRON during the evening won't work for you?

Thanks for the very prompt response, @koaning, as usual! :slight_smile: The prodigy support forum is one of the greatest things of this tool so far !

I was trying to make the dataset: idea work, but what I found through some tests is that something like:

does not work, because it adds the examples with "answer": "ACCEPTED" already, therefore it does not distribute the new texts to another, it rather records as already annotated. Can you advise?

I think it's a slightly different use case. I want to be able to add datapoints in an ad-hoc way to an annotation task living in a server. I am able to run db-in from my local machine, but if this doesn't update the dataset, I'll have to "ssh" in, stop the server and re-run, or something similar, which doesn't sound very useful.

If none of the above works, yes, I'd invest in a custom recipe, but I wanted to exhaust all the "classical" pathways before going into that.

Ah, good point. I hadn't forseen that. This sounds to me like an issue that we should fix. It makes sense to me that a user might want to upload a dataset to the database for annotation and then there should be no answer flag. Will add a ticket internally to discuss this.

As an alternative for now, I think you should still be able to use the Python API instead. Something like this might work?

from prodigy.components.db import connect
import srsly 

examples = srsly.read_jsonl("path/to/file.jsonl")
db.add_dataset('uploaded-dataset-name')
db.add_examples(examples, ('uploaded-dataset-name',))

This should be used in a script that generates candidates that are interesting to annotate. Would that work?

I will have a try a report back!

That was promising but STILL doesn't work as intended. Full Script her:

from prodigy.components.db import connect
from prodigy.util import set_hashes
import srsly

examples = srsly.read_jsonl("test.jsonl")

db = connect()
db.add_dataset("test")

hashed_examples = [set_hashes(eg) for eg in list(examples)]

db.add_examples(hashed_examples, ["test"])

print(f"Added {len(hashed_examples)} examples to 'test' dataset")

And then when running prodigy rel.manual test en_core_web_sm dataset:test --label TEST
I seem to get "No tasks available". The examples seem to have added as "already annotated"

Screenshot 2023-11-01 at 10.42.01

I think essentially what I am getting is that there seems to be no obvious way to make the dataset: idea work, because we can't import examples and then make prodigy get streams from the dataset. Is that intended or should that behaviour change?

I worry that something else might be up. I'll show what I've done to try and replicate this issue.

I start out with this examples.jsonl file.

{"text": "hello my name is james"}
{"text": "hello my name is john"}
{"text": "hello my name is robert"}
{"text": "hello my name is michael"}
{"text": "hello my name is william"}
{"text": "hello my name is mary"}
{"text": "hello my name is david"}
{"text": "hello my name is richard"}
{"text": "hello my name is joseph"}

I use db-in to load it.

python -m prodigy db-in issue-6876 examples.jsonl

This dataset is now in the Prodigy database and I can confirm via db-out that it also has hashes and an accept key.

python -m prodigy db-out issue-6876

This yields:

{"text":"hello my name is james","_input_hash":-1294982232,"_task_hash":465224705,"answer":"accept"}
{"text":"hello my name is john","_input_hash":-1282353592,"_task_hash":980664449,"answer":"accept"}
{"text":"hello my name is robert","_input_hash":803732484,"_task_hash":-658149367,"answer":"accept"}
{"text":"hello my name is michael","_input_hash":-817151764,"_task_hash":-831281946,"answer":"accept"}
{"text":"hello my name is william","_input_hash":-1774976813,"_task_hash":-349743654,"answer":"accept"}
{"text":"hello my name is mary","_input_hash":-1806564888,"_task_hash":910542438,"answer":"accept"}
{"text":"hello my name is david","_input_hash":-630231151,"_task_hash":-1900462764,"answer":"accept"}
{"text":"hello my name is richard","_input_hash":-277245364,"_task_hash":-1834276944,"answer":"accept"}
{"text":"hello my name is joseph","_input_hash":467632156,"_task_hash":-1782316920,"answer":"accept"}

But here's the thing, I seem to be able to annotate this data just fine.

python -m prodigy textcat.manual xxx dataset:issue-6876 --label foo,bar,baz

This starts a server with this interface.

So, on my side at least, the accept key doesn't block the annotations. Can you confirm this on your end as well on this demo dataset? I'm curious if perhaps there's something else happening on your machine. Do you have extra settings in a config.json file?

Is it possible that a dataset called "test" already existed before you ran your script?

I'll have a look and report back!

Interesting. Yes, with your example, I can still annotate it again (which is somewhat disconcerting, probably not the intended behaviour for the accepted key?). I can also confirm that ner.manual and rel.manual also work! No idea what's going on.

Ok! I discovered what was happening in my example! When I was running the server I was doing:

prodigy db-in examples.json dataset-to-be-annotated
prodigy rel.manual *dataset-to-be-annotated* en_core_web_sm dataset:*dataset-to-be-annotated* --label LABELS--span-label SPAN LABELS

Note how the dataset is repeated. If I replace the first one to current-dataset than it all works.

The quirk is that db-in STILL imports as accepted inside dataset-to-be-annotated, however it is available to annotate in current-dataset, which is quite interesting. I wonder if the --answer parameter in db-in should allow me to set no answer (--answer null or sonething) to avoid confusion.

1 Like