Sure, that's possible! Many of our built-in recipes and examples use spaCy and we hope that they make it easy to get started. But the tool itself is definitely designed to function independently of spaCy. You can always stream in data in Prodigy's JSON format or write a custom recipe without any update
callbacks that only presents you with the tasks, and stores your data in the database (which you can then export as JSONL). You can find more details and examples in the "Annotation task formats" section in your PRODIGY_README.html
, available for download with Prodigy.
One thing that's important to keep in mind about the manual NER interface: in order to make annotation more efficient and "snap" your selection to the word boundaries, the text is pre-tokenized . Out of the box, you can use a spaCy model for this – or plug your own solution. The annotation tasks that are sent to the web app when you use ner.manual
then look like this:
{
"text": "Hello Apple",
"tokens": [
{"text": "Hello", "start": 0, "end": 5, "id": 0},
{"text": "Apple", "start": 6, "end": 11, "id": 1}
],
"spans": [
{"start": 6, "end": 11, "label": "ORG", "token_start": 1, "token_end": 1}
]
}
If you have already annotated text , all you need to do is convert it to Prodigy's JSONL format. The ner.manual
recipe will respect pre-defined "spans"
and display them when you load in your text. It will even assign a "tokens"
property to each annotation task and try to resolve the "spans"
to the respective token indices. This usually works very well.
You can then run the ner.manual
recipe and stream in your pre-annotated data, or a mix of pre-annotated and raw data:
prodigy ner.manual your_dataset en_core_web_sm /your_converted_data.jsonl --label PERSON,ORG
You might find this custom recipe useful:
Alternatively, if you want to do your evaluation outside of Prodigy (which is totally reasonable, too), you can always use the db-out
command to export a dataset to a JSONL file and then convert that to any custom format.
prodigy db-out your_dataset > your_data_file.jsonl
This is also a nice solution if you're opinionated about how you want to format and read in your training, development and test data.