SpaCy Manual HTML Rendering Looks Wrong

I am using using displacy.render to render manual NER annotation in HTML and it’s looking wrong. I am running the following code:

displacy.render(document_html, style="ent", page=True, manual=True, options={"colors:{"AUTHOR":"Salmon"}})

The contents of document_html are in the attached document-1.entities.jsonl (798 Bytes)
file. (Actually that file’s contents are incorrect. See my reply below.) The original text is:

Pynchon and Nabokov

SECTION 1.0 Thomas Pynchon
Thomas Pynchon’s greatest novel is “Gravity’s Rainbow”.

It tells the story of an American army officer in occupied
Germany pursuing a mystical V-2 rocket.

SECTION 2.0 Vladimir Nabokov
Vladimir Nabokov’s greatest novel is “Lolita”.

“Lolita”'s main character, Humbert Humbert, is one of the
most famous unreliable narrators in all of literature.

I’ve verified that the “text” value in document_html is correct, as are all the entity character offsets. I expect all the “Thomas Pynchon” and “Vladimir Nabokov” spans to be highlighted in salmon with the label “AUTHOR”. Instead I see this.

(I see the same problem if I change the command option to page=False.)

Am I doing something wrong or is this a bug?

Here’s the HTML that was generated:

<!DOCTYPE html>
<html>
    <head>
        <title>displaCy</title>
    </head>

    <body style="font-size: 16px; font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol'; padding: 4rem 2rem;">
<figure style="margin-bottom: 6rem">
<h2 style="margin: 0">document-1.txt</h2>

<div class="entities" style="line-height: 2.5">Pynchon and Nabokov</br></br>SECTION 1.0 Thomas Pynchon</br>Thomas Pynchon's greatest novel is "Gravity's Rainbow".</br></br>It tells the story of an American army officer in occupied</br>Germany pursuing a mystical V-2 rocket.</br></br></br>SECTION 2.0 
<mark class="entity" style="background: Salmon; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone">
    Vladimir Nabokov
    <span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">AUTHOR</span>
</mark>
</br>
<mark class="entity" style="background: Salmon; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone">
    Vladimir Nabokov
    <span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">AUTHOR</span>
</mark>

<mark class="entity" style="background: Salmon; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone">
    Thomas Pynchon
    <span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">AUTHOR</span>
</mark>
</br>
<mark class="entity" style="background: Salmon; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone">
    Thomas Pynchon
    <span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">AUTHOR</span>
</mark>
's greatest novel is "Gravity's Rainbow".

It tells the story of an American army officer in occupied
Germany pursuing a mystical V-2 rocket.


SECTION 2.0 Vladimir Nabokov
Vladimir Nabokov's greatest novel is "Lolita".

"Lolita"'s main character, Humbert Humbert, is one of the
most famous unreliable narrators in all of literature.
</div>
</figure>
</body>
</html>

Just to make sure I understand this correctly: The value of document_html is the contents of the JSONL file? I’m a little surprised it works at all in the first place, because your format is pretty different from the “manual” format displaCy expects. See this page for the format, which looks like this:

{
    'text': 'But Google is starting from behind.',
    'ents': [{'start': 4, 'end': 10, 'label': 'ORG'}],
    'title': None
}

The displaCy format existed before Prodigy and before we introduced the “simple training style” shortly before launching v2.0 stable. This is why the formats aren’t perfectly consistent at the moment.

Disregard the document-1.entities.jsonl linked above. That was a mistake.

Here is a JSON representation of the Python object I am passing as the document_html parameter to displacy.render. This should be the correct format for manual NER rendering in HTML.

[
  {
    "ents": [
      {
        "label": "AUTHOR",
        "end": 234,
        "start": 218
      },
      {
        "label": "AUTHOR",
        "end": 251,
        "start": 235
      },
      {
        "label": "AUTHOR",
        "end": 47,
        "start": 33
      },
      {
        "label": "AUTHOR",
        "end": 62,
        "start": 48
      }
    ],
    "text": "Pynchon and Nabokov\n\nSECTION 1.0 Thomas Pynchon\nThomas Pynchon\'s greatest novel is \"Gravity's Rainbow\".\n\nIt tells the story of an American army officer in occupied\nGermany pursuing a mystical V-2 rocket.\n\n\nSECTION 2.0 Vladimir Nabokov\nVladimir Nabokov\'s greatest novel is \"Lolita\".\n\n\"Lolita\"'s main character, Humbert Humbert, is one of the\nmost famous unreliable narrators in all of literature.\n",
    "title": "document-1.txt"
  }
]

No worries – thanks for updating! Could you try ordering the entities by "start" index? I think displaCy currently doesn’t do this in “manual” mode and expects them to come in in order, so the rendering algorithm can step through them one by one.

That fixes the issue. Thanks.

You might want to make note of this in the Rendering Data Manually section of the spaCy documentation.