How do I use prodigy as a purely annotation tool with no underlying SpaCy model?


I am new to both Prodigy and SpaCy. I want to use Prodigy just to be able to create annotations (train / test data) quickly and efficiently and then export the data into Json without having to use any underlying SpaCy model or using active learning.

Right now, I am running the ner.manual recipe to create and save the annotations. Then I use to export the data in a format usable with SpaCy. However, I can’t quite figure out how to view and correct already annotated text. Is there an existing recipe there for this purpose and if not, can someone help me get started on one of my own? I would also be grateful for inputs about how to evaluate existing SpaCy models on annotated NER data without using this data to train the model. I just want the performance scores of a SpaCy model on test data stored in .jsonl files.

Sure, that's possible! Many of our built-in recipes and examples use spaCy and we hope that they make it easy to get started. But the tool itself is definitely designed to function independently of spaCy. You can always stream in data in Prodigy's JSON format or write a custom recipe without any update callbacks that only presents you with the tasks, and stores your data in the database (which you can then export as JSONL). You can find more details and examples in the "Annotation task formats" section in your PRODIGY_README.html, available for download with Prodigy.

One thing that's important to keep in mind about the manual NER interface: in order to make annotation more efficient and "snap" your selection to the word boundaries, the text is pre-tokenized . Out of the box, you can use a spaCy model for this – or plug your own solution. The annotation tasks that are sent to the web app when you use ner.manual then look like this:

    "text": "Hello Apple",
    "tokens": [
        {"text": "Hello", "start": 0, "end": 5, "id": 0},
        {"text": "Apple", "start": 6, "end": 11, "id": 1}
    "spans": [
        {"start": 6, "end": 11, "label": "ORG", "token_start": 1, "token_end": 1}

If you have already annotated text , all you need to do is convert it to Prodigy's JSONL format. The ner.manual recipe will respect pre-defined "spans" and display them when you load in your text. It will even assign a "tokens" property to each annotation task and try to resolve the "spans" to the respective token indices. This usually works very well.

You can then run the ner.manual recipe and stream in your pre-annotated data, or a mix of pre-annotated and raw data:

prodigy ner.manual your_dataset en_core_web_sm /your_converted_data.jsonl --label PERSON,ORG

You might find this custom recipe useful:

Alternatively, if you want to do your evaluation outside of Prodigy (which is totally reasonable, too), you can always use the db-out command to export a dataset to a JSONL file and then convert that to any custom format.

prodigy db-out your_dataset > your_data_file.jsonl

This is also a nice solution if you're opinionated about how you want to format and read in your training, development and test data.