I’m using several jsonl files to create a training corpora with ner.manual. I realised I needed another label to finish annotating the current file. I saved the annotations in my dataset, stopped the terminal and runned ner.manual again on the said source file. However, I can’t edit the annotations, as they were already saved. I tried to export the dataset to edit it manually and put it back again in the source, but I have troubles understanding the commands and the way to edit annotations.
There are start, end, token_start and token_end labels in the db-out outfile. Where should I do the editing ? Should I edit all positions of following annotations ? How can I add annotations in an already saved file without annotating again everything ?
Another question : after using data-to-spacy to export the training corpora as a JSON file, how can we use that file to train a NER model in SpaCy ? Also, is there a way to reuse the POS tags in already existing models ?
Hi! You can always load an existing dataset back in for annotation – just use the dataset: syntax instead of the source file (e.g. dataset:your_previous_dataset) and then save the results to a new dataset. You'll then see the existing annotations, and you'll be able to add new ones on top of that.
Alternatively, you can also create an entirely new dataset and only annotate the new labels. This can sometimes make sense if you're annotating one label at a time. Prodigy will merge all annotations on the same text when you train or export the data with data-to-spacy.
(In theory, you could also edit the JSON manually, if you really want to – you can find an example of the format here: Annotation interfaces · Prodigy · An annotation tool for AI, Machine Learning & NLP But it's not necessarily something I'd recommend. It does work, though – just add the spans with their start and end index and the given label.)
If you mean the POS tags predicted by the model on your specific data: Yes, you can use spaCy to label the data for you automatically, using the existing model. You could also use an annotation recipe like pos.correct and manually review and correct the model's predictions on your data, and then train a POS tagger using the collected annotations.
Thanks for your help. I still have a few questions, though.
I followed your advice and used db-in to reuse my previous annotations in a new dataset.
If I'm trying to re-annotate a file previously annotated, I still have the following message in http://localhost:8080 : "No tasks available.". It seems that we can't re-annotate the same file again, even after importing the annotations into a new dataset. Am I right ?
Maybe I shall specify that I'm working on very long texts and can't segment sentences, as the sentence segmentation does not perform well on my data.
So, if I understood well, the idea is to create a new dataset for the considered file and only annotate the missing examples ? So all annotations will be merged with data-to-spacy ? There won't be a chance for the model to consider that some parts of text should not be annotated when generalizing to new data ?
In the general case, would you recommend using a dataset per file ? I though about annotating all files in a single dataset, but in retrospect, it seems difficult to delete/edit annotations. Maybe in case I want to correct, delete, or do not want to consider a specific file anymore, dealing with multiple datasets helps. I would not risk to lose any data. However, it means that in my case, I shall deal with tens datasets. Would it be appropriate ?
You shouldn't have to re-import anything – you should be able to just point Prodigy to your existing dataset as the input source, and save the results to a new dataset. If your current dataset already contains an annotation for a given example, Prodigy will skip it, so you won't be asked the same question twice – that's typically very useful, because it means you can restart annotation and resumt where you left off. But if you're re-annotating your data, you usually want to save the results to a new dataset.
When Prodigy merges the datasets before annotation, it will use the _input_hash (generated based on the input data, e.g. the raw text) to determine which annotations refer to the same text. It will then combine all "spans" into one training example. So if one dataset annotates all PERSON entities and another annotates all ORG entities in a text, you'll end up with one example per text containing annotations for both labels. All tokens that are then left unannotated are considered "outside an entity" by default (you can change that by setting --ner-missing and consider them missing values instead, but you should only do that if you know that you're working with sparse annotations).
This depends on the project but yes, in general, we recommend using more fine-grained datasets if possible, because it gives you more flexibility. You can start over if you make a mistake, or try out different combinations of annotations, all without ever losing any data points. (That's also why Prodigy datasets are append-only by design: you should always be able to reconstruct every single data point and an annotation process should never overwrite or lose any data, even if you create a newer vesion of a data point.)
Thank you for the detailed explanation, it's very clear.
Is there any way to reuse basic recipes (ner.manual, db-out, data-to-spacy, etc) in Python instead of command line ? For automation, it would be more convenient for me, especially if I create a dataset per file.
Yes, under the hood, they're all Python functions. If you want to interact with the database programmatically, there's also a user-facing database API that gives you more flexibility and lets you create datasets and fetch examples. See here for details: Database · Prodigy · An annotation tool for AI, Machine Learning & NLP