Can relations view_id use HTML render instead of text tokens?

How can I render HTML for relations custom recipe? I want to annotate relations in text which comes from HTML and without looking at actual rendered HTML it's not possible to annotate what goes where.

This is what I came up with but it shows spans instead of HTML even if there's html attribute together with text , and tokens attributes


import prodigy

from typing import Generator, TypedDict, Optional
import srsly
import bs4
from bs4 import BeautifulSoup, Tag
import en_core_web_sm

nlp = en_core_web_sm.load()

# from notebooks/prodigy.ipynb

def html_to_text(html: str) -> str:
    soup = BeautifulSoup(html, features="html.parser")

    # kill all script and style elements
    for script in soup(["script", "style"]):
        script.extract()    # rip it out

    # get text
    text = soup.get_text()
    return text

def body_only(html: str) -> Tag | bs4.element.NavigableString | None:
    soup = BeautifulSoup(html, features="html.parser")
    # kill all script and style elements
    for script in soup(["script", "style"]):
        script.extract()    # rip it out
    body = soup.find('body')
    return body

class Example(TypedDict):
    html: str
    text: str

def g(data: Example) -> Optional[Example]:
    body = body_only(data['html'])
    if body is not None:
        # Keep only body because page HTML breaks prodigy CSS
        example: Example = {"html": str(body),  # XXX 123 does not typecheck though
                            "text": html_to_text(body.get_text())}
        return example
    return None

def text_to_tokens(text):
    doc = nlp(text)
    id = 0
    tokens = []
    for tok in doc:
        # {"text": "My", "start": 0, "end": 2, "id": 0, "ws": true},
        t = {"text": tok.text, "start": tok.idx, "end": tok.idx +
             len(tok.text), "id": id, "ws": tok.is_space}
        # print(t)
        id = id + 1
        # break
    return tokens

def load_my_custom_stream(source: str = "b.jsonl") -> Generator:
    for data in srsly.read_jsonl(source):
        b = g(data)
        if b is not None:
            tokens = text_to_tokens(b['text'])
            yield {"html": b['html'], "text": b['text'], "tokens": tokens}

blocks = [
    {"view_id": "relations"}

    dataset=("Dataset to save answers to", "positional", None, str),
    view_id=("Annotation interface", "option", "v", str),
    source=("Source JSONL file", "option", "s", str)
def my_custom_recipe(dataset, view_id="html", source="./notebooks/b.jsonl"):  # TODO remove view_id
    # Load your own streams from anywhere you want
    stream = load_my_custom_stream(source)

    def update(examples):
        # This function is triggered when Prodigy receives annotations
        print(f"Received {len(examples)} annotations!")

    return {
        "view_id": "blocks",
        "dataset": dataset,  # Name of dataset to save annotations
        "stream": stream,  # Incoming stream of examples
        "config": {"blocks": blocks,
                   "labels": ["Within the", "Near the"],
                   "relations_span_labels": ["PATHOLOGY", "DESCRIPTOR", "LOCATION"]}

I think the tokeniser that you're using is generating a bunch of newline characters that you're not interested in, but the only way to confirm for me is to try and run this locally. Do you have a bit of data that you can share so that I may replicate your situation locally on my machine? It would also help to know the versions of the Python packages that you're using.

Also, bit of a detail, you seem to have written your own text_to_tokens function. Prodigy also offers helpers for this, documented here:

@koaning thanks for helping with this! The problem is that I want to label relations on a rendered HTML and not on what relations provide for annotating in lines of text

If I could somehow hack relations to be able to tell it to use existing "div"s for clicks for annotating relations that would be perfect :slight_smile: @ines is something like that possible at all?

I've created repro for all of this here

Oh, nice, thank you!

TBH, it took me about 4 hours trying to make prodigy do what I want and then I gave up and coded a chrome extension which lets you just click divs and annotate relations in about an hour. Its a pitty because I would love to stitch prodigy with more sophisticated scenarios like this :confused:

Ahhh, that makes your goal a lot clearer to me. Unfortunately, that's not something that Prodigy allows you to do with it's base recipes right now. It might be interesting to toy around with a custom tokeniser for HTML, but eventually you'll hit issues with that later down the road as well because of how nested HTML can be in general. If you're interested in selecting a component, you'd want to select the opening <div> tag as well as the </div> closing one.

Could you elaborate a bit more on the task though? What kinds of html elements are you trying to link?

If this is open sourced I'd love to have a look at it. I can't make any promises, but after v1.12 we'll be looking at some new interfaces for Prodigy v2 ... and having one for HTML certainly sounds like something worth considering.