Loading message prodigy UI

New to prodigy so please bare with!

So our process flow is;
1 - User loads documents for annotation - python script
2 - Documents get converted in to text strings - python script
3 - Text strings get loaded into prodigy jsonl layout - python script which creates file text_to_annotate.jsonl
{"text": "some text"}
{"text": "Other text"}
4 - User creates name of dataset and creates dataset on postgres database
prodigy dataset new_dataset "testing 1 of 12" --author Me
5 - jsonl gets loaded in to postgres database
prodigy db-in new_dataset text_to_annotate.jsonl
6 - User opens prodigy
prodigy ner.manual new_dataset model-final

Then what we see is just the loading message on the prodigy ui.

Tried loading the jsonl directly;
prodigy ner.manual new_dataset model-final text_to_annotate.jsonl

And get the no tasks available message.

Im sure we are doing something fundamentally wrong, but cant figure out what!

Thanks in advance!

Hi! Your workflow looks good so far :+1: I think the main problem is here:

Prodigy doesn't require you to pre-upload anything to annotate. The dataset in the database is only used to store the collected annotations (and you can import already annotated data if you want to). So to start annotating, you'd typically use a blank dataset and then load the JSON file directly from its path.

In your case, the following things happened, which caused the confusion you describe:

  1. Prodigy allows you to leave out the source argument (specifying the input data) and will then read from standard input. This is useful if you want to pipe data forward from a previous process – or do something like cat file.jsonl | prodigy .... So you kept seeing the "Loading...", because Prodigy was waiting for something to be streamed in from standard input.
  2. By default, Prodigy will automatically skip incoming examples that are already in the database, so you're not asked to annotate the same question twice. That's typically very useful – but since you already uploaded the raw data to the dataset, Prodigy thinks "Yay, all done and annotated!" because the examples you load in are already in the dataset.

So as a solution, create a new, empty dataset and then rerun the command that loads the data from your JSONL file:

prodigy dataset fresh_dataset
prodigy ner.manual fresh_dataset model-final text_to_annotate.jsonl
1 Like

AMAZING THANKS! So ive over engineered the problem!!

1 Like

So problem solved. Thanks once again.

What i am now struggling to understand is how the tasks are managed.

So i create jsonl file, kick off prodigy, do some annotations while someone over-writes the json. Do the first json files records sit in memory then next time we kick off prodigy it will refer to new json? I just did this by mistake and noticed that prodigy didnt crash, so curious as to why (its not a bad thing but i need to understand why!)


This mostly depends on how the Python reads in the files – JSONL files are typically read in line-by-line, so it doesn't necessarily read the entire file into memory, just the next batch of lines it needs. If you load a JSON file, it typically loads and parses the entire file, so you have it all in memory.

That said, you should aviod a setup where the files you're loading in are overwritten while you're working with them (independent of Prodigy, I'd say this is true for most data science tasks). If you do want the stream of incoming examples to by more dynamic (e.g. load new records from an API), you can do this with a custom recipe or custom loader script. Prodigy streams are generators, so you can write arbitrary code, respond to outside state etc.

Fantastic thanks. It was a mistake i made while testing a custom loader script, i got confused!

One last ask this week! Is there a cli command to stop prodigy? We kick off prodigy via a r shiny front end button click, but can figure out how to stop it!

THis is the code for starting prodigy via python script.
def annotate(name):
#run annotator for current dataset
program_str = 'prodigy'
func_to_run = 'ner.manual'
dataset = name
model = 'model-final'
file_to_run = 'text_to_annotate.jsonl'
cmd_str3 = ''.join([program_str, ' ', func_to_run, ' ' ,dataset, ' ',model,' ', file_to_run])

Ideally i would like to be able to stop via a button as well?

You typically want to be sending a SIGTERM signal and terminate the process on the port that Prodigy is running on. Not sure how this would work in the shiny app, but here are some links and examples:

Doh, im always over complicating things! Done and very happy here...thanks!