Can relations view_id use HTML render instead of text tokens?

ysz · May 31, 2023, 2:15pm

How can I render HTML for relations custom recipe? I want to annotate relations in text which comes from HTML and without looking at actual rendered HTML it's not possible to annotate what goes where.

This is what I came up with but it shows spans instead of HTML even if there's html attribute together with text , and tokens attributes

# https://prodi.gy/docs/custom-recipes

import prodigy

from typing import Generator, TypedDict, Optional
import srsly
import bs4
from bs4 import BeautifulSoup, Tag
import en_core_web_sm

nlp = en_core_web_sm.load()


# from notebooks/prodigy.ipynb


def html_to_text(html: str) -> str:
    soup = BeautifulSoup(html, features="html.parser")

    # kill all script and style elements
    for script in soup(["script", "style"]):
        script.extract()    # rip it out

    # get text
    text = soup.get_text()
    return text


def body_only(html: str) -> Tag | bs4.element.NavigableString | None:
    soup = BeautifulSoup(html, features="html.parser")
    # kill all script and style elements
    for script in soup(["script", "style"]):
        script.extract()    # rip it out
    body = soup.find('body')
    return body


# https://docs.python.org/3/library/typing.html#typing.TypedDict
class Example(TypedDict):
    html: str
    text: str


def g(data: Example) -> Optional[Example]:
    body = body_only(data['html'])
    if body is not None:
        # Keep only body because page HTML breaks prodigy CSS
        example: Example = {"html": str(body),  # XXX 123 does not typecheck though https://stackoverflow.com/q/76373303/7424605
                            "text": html_to_text(body.get_text())}
        return example
    return None


def text_to_tokens(text):
    doc = nlp(text)
    # https://prodi.gy/docs/api-interfaces#relations-settings
    id = 0
    tokens = []
    for tok in doc:
        # {"text": "My", "start": 0, "end": 2, "id": 0, "ws": true},
        t = {"text": tok.text, "start": tok.idx, "end": tok.idx +
             len(tok.text), "id": id, "ws": tok.is_space}
        # print(t)
        tokens.append(t)
        id = id + 1
        # break
    return tokens


def load_my_custom_stream(source: str = "b.jsonl") -> Generator:
    for data in srsly.read_jsonl(source):
        b = g(data)
        if b is not None:
            tokens = text_to_tokens(b['text'])
            yield {"html": b['html'], "text": b['text'], "tokens": tokens}


blocks = [
    {"view_id": "relations"}
]


@prodigy.recipe(
    "my-custom-recipe",
    dataset=("Dataset to save answers to", "positional", None, str),
    view_id=("Annotation interface", "option", "v", str),
    source=("Source JSONL file", "option", "s", str)
)
def my_custom_recipe(dataset, view_id="html", source="./notebooks/b.jsonl"):  # TODO remove view_id
    # Load your own streams from anywhere you want
    stream = load_my_custom_stream(source)

    def update(examples):
        # This function is triggered when Prodigy receives annotations
        print(f"Received {len(examples)} annotations!")

    # https://support.prodi.gy/t/enabling-both-assign-relations-and-select-spans-in-custom-relations-recipe/3647/5?u=ysz
    return {
        "view_id": "blocks",
        "dataset": dataset,  # Name of dataset to save annotations
        "stream": stream,  # Incoming stream of examples
        "config": {"blocks": blocks,
                   "labels": ["Within the", "Near the"],
                   "relations_span_labels": ["PATHOLOGY", "DESCRIPTOR", "LOCATION"]}
    }

koaning · June 1, 2023, 8:27am

I think the tokeniser that you're using is generating a bunch of newline characters that you're not interested in, but the only way to confirm for me is to try and run this locally. Do you have a bit of data that you can share so that I may replicate your situation locally on my machine? It would also help to know the versions of the Python packages that you're using.

Also, bit of a detail, you seem to have written your own text_to_tokens function. Prodigy also offers helpers for this, documented here:

ysz · June 2, 2023, 9:45am

@koaning thanks for helping with this! The problem is that I want to label relations on a rendered HTML and not on what relations provide for annotating in lines of text

If I could somehow hack relations to be able to tell it to use existing "div"s for clicks for annotating relations that would be perfect @ines is something like that possible at all?

I've created repro for all of this here

Oh, nice, thank you!

TBH, it took me about 4 hours trying to make prodigy do what I want and then I gave up and coded a chrome extension which lets you just click divs and annotate relations in about an hour. Its a pitty because I would love to stitch prodigy with more sophisticated scenarios like this

koaning · June 2, 2023, 10:09am

Ahhh, that makes your goal a lot clearer to me. Unfortunately, that's not something that Prodigy allows you to do with it's base recipes right now. It might be interesting to toy around with a custom tokeniser for HTML, but eventually you'll hit issues with that later down the road as well because of how nested HTML can be in general. If you're interested in selecting a component, you'd want to select the opening <div> tag as well as the </div> closing one.

Could you elaborate a bit more on the task though? What kinds of html elements are you trying to link?

If this is open sourced I'd love to have a look at it. I can't make any promises, but after v1.12 we'll be looking at some new interfaces for Prodigy v2 ... and having one for HTML certainly sounds like something worth considering.

Topic		Replies	Views
Custom relation recipe usage , front-end , relations	2	365	December 27, 2021
relation recipe missing span annotation on custom tokens because of tokenization didnt match relations , spancat	1	350	September 15, 2022
html custom recipe to display all "accept" annotations in db (without need to do anything else) usage , custom , front-end , solved	5	463	April 6, 2022
Displaying Span/Token Metadata usage , custom , front-end , relations	2	464	February 24, 2021
NER manual on view id HTML usage , ner , custom	1	866	May 16, 2019

Can relations view_id use HTML render instead of text tokens?

Related topics