Custom recipe w/o model

Hello,

I’ve been trying to create custom recipes for collecting single and multi-label annotations for NER without a spacy model (w/o model-in-the-loop), but have been running into some issues. I haven’t been able to locate examples of annotation-gathering being done w/o models, which is making this difficult. It seems like it should be possible, but I’m not sure what the config arguments should be in this case (esp. concerning labels) and how the data stream should be formatted.

For example, given the code for ner.manual (from which I’ve removed the arguments pertaining to the spacy model), how would I format config and stream? I have config as {‘labels’: label} and stream as [{‘text’: ‘text’}…], but I’m getting an “Oops something went wrong” page when I try to run it.

Thanks for your help.

Hi! I think the solution might actually be easier than you think :blush:

One thing that’s important to note here is that ner.manual doesn’t actually update a model in the loop – it only uses the spaCy model for tokenization. Pre-tokenizing the text allows you to annotate faster because the highlighted selection can “snap” to the token boundaries. So if you run the ner.manual recipe out-of-the-box, it will stream in the text so you can annotate it manually and save the annotations to your dataset (which you can then use however you like).

If you don’t want to use spaCy for tokenization, you can also implement your own logic. The input format for the manual NER interface expects the data to have an additonal "tokens" property. You can find more details on this in the “Annotation task formats” section of your PRODIGY_README.html. Here’s a simple example:

{
    "text": "Hello Apple",
    "tokens": [
        {"text": "Hello", "start": 0, "end": 5, "id": 0},
        {"text": "Apple", "start": 6, "end": 11, "id": 1}
    ]
}

Ended up implementing own logic for tokenization, and it worked. Thanks for the quick and thorough response!

1 Like