I'm using the mark recipe to add NER annotations to a dataset of 1000 documents. After just 10 documents, I get "No tasks available." and I cannot do anything other than close the application. When I restart, all my annotations are gone (even though I clicked the save button), and I'm shown the same 10 documents.
I would like to be able to view and annotate the entire dataset; how can I do that?
Hi! Could you share the contents of your prodigy.json? If you're on v1.10.8 and are not using named multi-user sessions, make sure that you don't have feed_overlap set in there.
Exporting doesn't work (I'm using python -m prodigy db-out ); it just outputs an empty file. I can't build a new dataset now because it's very time-consuming.
I created a new DB with the same dataset and then ran the mark recipe. Prodigy shows the correct number of documents in the dataset (1000), but it shows "No tasks available" as soon as I open it.
Where do you see the number of documents in the dataset? Normally, Prodigy would read it in as a stream so it wouldn't necessarily have that number upfront.
And just to confirm, you didn't upload your raw data to the dataset, right? Because you shouldn't have to do that, and it'd cause examples that are already in the dataset to be skipped (because they're already saved), so you'd see nothing left to annotate anymore.
I don't understand what you mean by this sentence "you didn't upload your raw data to the dataset".
The data I'm trying to open in Prodigy is a jsonl file, that's already tokenized and annotated, but I want to fix the annotations and add more. The tokenization and annotation happened in Python, using a custom model trained in spacy, then the data was saved to the jsonl file.
I think Prodigy shouldn't try to make any decisions on what to show me; if I load a dataset with 1000 documents, it should just show 1000 documents. If it has some "saved" examples, it should show them to me as well. Hiding everything from the user is not a good design imho.
Is there a way to find a solution faster? Like a call or something? This is taking a bit too long.
Sorry if this was confusing: What I meant was that Prodigy doesn't expect you to pre-upload anything – you can just start annotating with a source file. The datasets are only intended to contain annotated examples. So if you upload your unannotated data to a dataset and then use that to annotate, Prodigy will skip them by default because they're already present in the dataset.
If you're annotating a given source, you'd typically want to skip all examples that you've already annotated so when you restart the server with the same source, you don't get asked the same question twice or end up with duplicates. You can disable this by setting the "auto_exclude_current": falseconfig setting but it's usually not what you want because then you end up with duplicates that you have to reconcile later.
So I think the simple answer here is: don't upload anything beforehand and just start annotating. Or, if you want to annotate the same data twice, use separate datasets so you can track what you've already worked on on a per-dataset basis.
There's a way that might help you in the case you want to fix the annotations and add more. Using the jsonl file that is already tokenized and annotated, train a model first. Then create a new dataset and use ner.correct recipe with the model you trained before on the documents. By this you can check the annotations you made before and can fix what you like.
Thank you very much for the suggestion! Unfortunately I can't do this, because the version of Prodigy I'm using cannot load the model I have, because the model has a transformer component. I do appreciate the suggestion though, thanks!
Sadly I cannot use your suggested solution. I would like to bootstrap the annotations, meaning I want to use the current version of the model to annotate more documents, then train a new iteration of the model with the additional documents, so I cannot start with an unannotated dataset (if that's what you mean by "don't upload anything beforehand").
What I'm trying to do is really simple and basic: load a jsonl file (tokenized and annotated) and go through the documents one by one to fix the annotations and add more annotations. Then export the annotated dataset so I can use it to train a new version of the model.
I do appreciate your help and I thank you very much, but I think I'll give up on Prodigy and I'll try to find a different way to do the annotations.
Hi! I'm not really sure where the confusion happened – sorry about that! What you describe is a super classic Prodigy workflow that'll just work out-of-the-box. The only thing to keep in mind is that you won't have to pre-upload anything – you can just start annotating with the pre-labelled data.
So in your case, you'd load the JSONL file as the input file and start with a blank dataset. For example:
As you annotate, the corrected annotations will be saved to the dataset. If you need to restart the server, you'll be able to start where you left off. When you're done annotating, you can export the dataset and use it for training.
Thank you very much for your latest suggestion; I managed to make it work. What I hadn't realized is that I can run a recipe with a blank:en model and still have the annotations saved in the source jsonl file.
I think the confusion was due to different ways of interpreting the terms; for example I assume by "upload" you mean "import to the sqlite db" (e.g. using prodigy db-in)? Also, I used the term "dataset" to simply refer to the jsonl file that contains my documents, while I believe you meant the contents of the DB?
It took me a bit of time to decipher the terms, but anyway I'm very glad it works. Thanks so much for your support; I truly appreciate it.
Glad it's working and sorry for the confusion! And yes, we typically use "dataset" to mean a "dataset of collected annotations in the database". And "source" or "input file" as the original input that that you're looking to annotate.