Load pre-tagged entities ner.manual

Hello,
I am trying to create a method, with which, I can load words that I have already tagged with one of the labels passed to ner.manual so that they appear already tagged but also I want to have the ability to retag them if I need to correct them.For example, in my case I am trying to tag if a word is a first name,a last name or other labels.So what I want to do is, i.e load the word Jack and make it appear with a first name tag from the get-go,but also have the ability to change its tag to last name.Is that possible?
Thanks in advance.

Hi,

It should be easy to do this with a custom recipe. Have a look at the ner.make-gold recipe, which should be in your site-packers (run python -c "import prodigy.recipes.ner; print(prodigy.recipes.ner.__file__)" if you’re not sure where to find it). You can read more about writing custom recipes here: https://prodi.gy/docs/workflow-custom-recipes

Basically, all you need to do is have your function return a dictionary, with the entries "view_id": "ner_manual" and "stream": tagged_data. The tagged_data value should be a generator with your annotated examples. Each entry in the generator should be a dict, with the the text and the spans.

Hello and thanks for your time,
I’m not entirely sure how to implement that . Should I just use ner.make-gold and make an extra function that does what you mentioned?
Thanks again.

I think @honnibal might have misread your question – if I understand it correctly, the answer is much simpler. ner.manual respects pre-defined entities, so all you have to do is feed in data in Prodigy’s format. For example, your input data could look like this:

{"text": "Duckyyy is a name", "spans": [{"start": 0, "end": 7, "label": "NAME"}]}

You can then run ner.manual with your pre-tagged data and a list of labels you want to use:

prodigy ner.manual your_dataset en_core_web_sm your_data.jsonl --label NAME,PERSON,ORG
1 Like

Thanks a ton Ines ,that’s exactly what I was looking for.

Just a last question. Is there any way I can use ner.manual without a spacy model, because it seems like my tokenization rules collide with the ones that the model has and as a result, I get an error when it tries to load a differently tokenized word. To be more precise I take this word as one I&K and it seems like there is a conflict with how the spacy model tokenizes it.

Yes, you definitely can – you just need to provide your own tokenization. The manual interface pre-tokenizes the text so your selection can “snap” to the token boundaries. This makes it easier and faster to annotate (and also helps you spot potential tokenization issues).

If you know how the text should be tokenized, you can provide an additional "tokens" property. If it’s present on the data, Prodigy will just use your tokenization instead of trying to translate your spans back to spaCy’s tokenization. Here’s an example:

{
    "text": "Hello Apple",
    "tokens": [
        {"text": "Hello", "start": 0, "end": 5, "id": 0},
        {"text": "Apple", "start": 6, "end": 11, "id": 1}
    ],
    "spans": [
        {"start": 6, "end": 11, "label": "ORG", "token_start": 1, "token_end": 1}
    ]
}

Each token should have a unique ID, its index, and each span can then define its "token_start" and "token_end" index.

Alternatively, if there are only very small differences, you could also adjust spaCy’s tokenization to match yours. This depends on what you’re most comfortable with. If you add your own rules to spaCy and then save out the model via nlp.to_disk(), the tokenizer will be saved out, too. (This is also the reason Prodigy lets you load in a model – it makes it easier to customise the behaviour without having to edit the recipe code.)

1 Like

Thanks a lot for your help until now. I’m sorry to bother you again but I was wondering if it is possible for a single entity to get multiple labels. What I mean is this for example,

{"text": "Duckyyy is a name", "spans": [{"start": 0, "end": 7, "label": "NAME" 
,"label": "PERSON"}]}

Thanks again.

Not in one go – but you can always make several passes over the same input text and annotate different types of entities. (Your example wouldn’t work, because the same key can only exist once – so one "label" would overwrite the other. But you can always annotate the same example twice in different datasets.)

Whether this makes sense depends on what you’re planning to do with the annotations later on. If you’re training a “standard” NER models, entity labels are typically mutually exclusive. So something is either a NAME or a PERSON – but not both.

It’d probably be more efficient to start off by improving an existing category like PERSON on your data, or train a broader category first. You can then always label the fine-grained categories afterwards, and train a model to predict those separately. You might also want to ask yourself whether you really need those two labels. It’s often better to focus on one really good definition that has the biggest impact for your application, rather than trying to do too much at once.

1 Like