User supplied NER labels

I need to annotate text with labels that are provided by the user. I've read other posts which indicate that this is not possible as the philosophy of prodigy is that adding labels on the fly is a bad thing. I understand the reasons in certain cases, however, I don't see any way around this in my case. I'm not trying to train a model, I'm trying to extract information/construct a dataset for research purposes. Specifically, I want to task annotators with identifying and expanding abbreviations, I'll then use this dataset for research - in particular, I'm interested in knowledge installation between models where the student model has additional labels to the teacher.

What I specifically need to do is present the annotator with phrases, for example, "24 Hr Time". I need them to indicate text that is abbreviated (for example Hr) and its expansion (Hour).

A previous topic suggested the use of a text box using the blocks view_id along with a label i.e. ABBREVIATION is the label and the user enters Hour. I tried this but the text entered by the user is added to a "user_input" field of the example, I need it to be a field of the span.

This is how I configured my recipe (based on ner.manual)

return {
    "view_id": "blocks",
    "config": { 
        "blocks": [
            {"view_id": "ner_manual"},
            {"view_id": "text_input"}
        ],
    },

Is there any way I can create a version of ner.manual where the user can add custom labels to spans?

Hi David,

thank you for your question.

If the block-view consisting of ner_manual and text_input is sufficient for you, you could do a simple post-processing step where you add the "user_input" field as the span's label. This could be done for example by implementing a before_db method such that the span label is modified before it is saved in the database (https://prodi.gy/docs/custom-recipes#before_db). But be careful here since a small bug in this function can lead to data loss.

Otherwise, if you want to add the span label on the fly, you might find this answer by Ines interesting even though this might need a bit more effort.

2 Likes

Thanks @Jette16, the two-pass option mentioned by Ines might work. Do you know if it's possible to highlight the token(s) that need text input, i.e. in the example 24 Hr Time in the second pass I'd need to highlight the annotation that needs secondary input (Hr) since I might have to present the same example multiple times if it contains multiple abbreviations?

Yes @david-waterworth, this is indeed possible, for example by using an html block.
In your recipe's config, you can define an html template using the option html_template (see https://prodi.gy/docs/custom-interfaces#html and https://prodi.gy/docs/api-interfaces#html) which could look like this:

<div>{{#tokens}}<mark style="background:{{c}};">{{t}}</mark>{{/tokens}}</div>

For each token in your text (and with using spaCy for iterating through the tokens), you can create a dictionary with {"t": token.text_with_ws, "c": color} with color being an rgb-string (e.g. #ffe184) for coloring words or "inherit" for non-coloring. The task you're yielding should then include a key "tokens" mapping to a list consisting of these token dictionaries for the task.

I hope this answers your question!

1 Like