Customizing Text Presentation in NER Annotation


(W.P. McNeill) #1

I’d like to customize two aspects of the way the text is presented for NER annotation.

  1. I’d like to have it left-justified instead of centered as it is in the ner.teach recipe.
  2. I want to control the amount of surrounding context, e.g. display n sentences around each candidate entity.

I presume I can do (1) with a custom recipe. I just haven’t figured out how yet.

I thought I could do (2) simply by controlling the size of the text in each of the training samples, but it appears that if the text in the training sample is large only some of it is displayed in the annotation web app. Is there a way to customize this?

Highlighting spans during text classification annotation
(Ines Montani) #2

Unfortunately, this is a little trickier, because the text styling etc. is defined in the global styles of the annotation card. You could work around this with a custom recipe and a HTML template or HTML tasks, but this would also mean you’d have to create the entity spans manually from the data, and it wouldn’t look as nice or be more work, because you’d have to take care of the styling yourself.

I’ve been thinking about adding a style option to the annotation card (and possibly other parts of the app) that’d let you specify any custom CSS overrides. So you’d be able to do something like "card_style": "text-align: left; font-weight: bold" etc. This feature has been on the roadmap for a while, but I wanted to wait and see if people actually wanted/needed something like this before making the app more complex by implementing it. But considering this has come up now, I’m happy to add it for the next release :blush:

The best solution that gives you maximum control would probably be to remove this line from the recipes/, or to replace it with your own logic:

# Split the stream into sentences
stream = split_sentences(model.orig_nlp, stream)

The split_sentences pre-processor splits the incoming text tasks into individual sentences, using the model’s sentence boundary detector. If you remove this step, you can control the surrounding context by passing it in as the "text" of the annotation task. Prodigy would then render whatever comes in, without splitting or modifying it.

(W.P. McNeill) #3

It makes sense not to add features unless they’re necessary, but since Prodigy emphasizes the importance of user experience in annotation you may be justified in putting extra effort into the details of how things look.

For example, my data consists of dense text whose meaningful context is at a minimum a few sentences long. The cognitive load of reading this in unfamiliar justification is quite high. And a reduction in cognitive load is one of the main values that Prodigy offers.

(BTW I’m very excited about what you’re doing here and happy to give feedback about my user experience as an annotator.)

(Ines Montani) #4

Fixed in v1.2.0! You can now specify a "card_css" setting in your prodigy.json or recipe config to overwrite any formatting of the annotation card content. For example:

    "card_css": "text-align: left"

(Bhanu Sharma) #5

Hey @ines, can you please give more details regarding this issue, i am unable to get what “text” in this context is and where to pass it?

(Ines Montani) #6

Sorry if this was unclear – I was referring to the "text" key of the annotation task dictionary. Under the hood, each individual annotation task (or “question”) looks like this:

    "text": "Hello Apple",
    "spans": [
        {"start": 6, "end": 11, "label": "ORG"}

In the web app, Prodigy will then display the value of "text" and, if available, entities, labels etc. If you’re loading in JSON or JSONL, you can control the text by modifying the "text" key. In CSV files, it’s the text column.

Btw, just a quick update on the thread topic: As of Prodigy v1.4.0, recipes that use sentence segmentation now also include an --unsegmented option to turn it off. You can also set a split_sents_threshold in your Prodigy config, which is the minimum number of characters needed for Prodigy to split the sentence automatically. This lets you use your own segmentation logic, while still keeping a fallback option in case you end up with a stray, very long example (which could otherwise easily throw off the model and make the active learning-powered recipes slow.)