ner.train on data not annotated by Spacy?

Hi,
I have a large set of data that I’ve annotated using a regular python script. The annotations are about 90% correct. I was hoping to use Prodigy to make corrections, then use the corrected data to train a Spacy model. Basically I want ner.manual, with the annotations pre-populated so I can either accept them or change the annotation. Is there a way to do this?

Yes, that makes sense and should hopefully be pretty easy to do. The ner.manual respects pre-defined entity spans, so if you convert the data into Prodigy’s JSONL format and load it in, you’ll be able to review the labelled entities and correct them. Here’s the expected format of an individual NER annotation task:

{
    "text": "Apple updates its analytics service with new metrics",
    "spans": [
        {"start": 0, "end": 5, "label": "ORG"}
    ]
}

Each entity is a dictionary that defines the start and end index (character offset in the text), as well as a label. You can find more details in the “Annotation task formats” section in your PRODIGY_README.html.

The only bit that’s special here is that Prodigy pre-tokenizes the text for the manual interface to allow token-based selection. This lets you annotate quicker, because the selection can “snap” to the token boundaries. The ner.manual recipe will usually take care of this for you, and it will also try to align all existing spans to tokens. In most cases, this should work fine – unless your texts are difficult to tokenize (e.g. lots of random punctuation, missing spaces etc.). So if you do come across examples where Prodigy fails to align your existing spans with the tokens, you can always provide a "tokens" property manually:

{
    "text": "Hello Apple",
    "tokens": [
        {"text": "Hello", "start": 0, "end": 5, "id": 0},
        {"text": "Apple", "start": 6, "end": 11, "id": 1}
    ],
    "spans": [
        {"start": 6, "end": 11, "label": "ORG", "token_start": 1, "token_end": 1}
    ]
}

When you run the recipe, you can pass in the name of the dataset to save the annotations to, a spaCy model (for tokenization only), the path to your converted data and your label set (comma-separated list or path to text file with one label per line):

prodigy ner.manual your_dataset en_core_web_sm /path/to/data.jsonl --label PERSON,ORG

Hi, thank you for your answer. I got it working but now the tokenizing issue is a big problem…I can do a few annotations but it fails when it encounters multiple punctuation (like "Inc., ").
Is there any way to disable the pre-tokenizer feature, or to put a try/except statement around that part of the code, so that it can just skip the ones that don’t work?
By the way I made an annotation tool a couple years ago and made it ‘snap’ by putting each word in a JQuery Selectable element, but I know there’s a lot more going on here…

Yes, you can set skip=True on the add_tokens preprocessor in the recipe – like this:

stream = add_tokens(nlp, stream, skip=True)

Ah yes, that's actually very similar to how it works in the web app, only that Prodigy uses the browser's native selection API. But the underlying problem is still to determine what's a word – and that's where the tokenization comes in. Since Prodigy already uses spaCy, we can tokenize the text using the rules of the specific language, which is usually more efficient than just splitting on whitespace.

If your goal is to actually train a model from your annotations, making sure the tokenization matches your entities is actually pretty important. Even if the model does learn that the tokens ['Something', 'Inc.'] are a company, it won't actually be able to apply that if it never comes across those tokens and only ever sees ['Something', 'Inc.,']. You can always work around that by adjusting the model's tokenization or by preprocessing your text accordingly. But knowing that this is a potential risk and which examples in your data are problematic is often very helpful.

Btw, spaCy's tokenization rules, including custom ones, are saved out with the model when you serialize it. So if you do want to adjust it and you've found a configuration that works well on your data, you can use to_disk to save the model to a directory:

nlp = spacy.load('en_core_web_sm')
# modify the tokenizer here...
nlp.to_disk('/path/to/model')

Prodigy accepts both model names and directories, so you can simply pass in the directory path when you call the recipe:

prodigy ner.manual your_dataset /path/to/model your_data.jsonl --label SOME_LABEL