Loading non-Prodigy pre-annotated text

Hi,

I want to load into Prodigy some pre-annotated texts for Relation Extraction from an open-source study. My idea is to review the annotated data and modify NER and RE labels based on different annotation guidelines. In this case, the authors provide a table with rows indicating relationships. Each row contains the sentence, with two columns indicating two entities in a relationship as well as their start and end character indexes in the sentence. An additional column contains the label for the relationship.

For example, the header of the table looks like:

Sentence | Ent_1 - start_char : end_char | Ent_2 - start_char : end_char | RE type

While I expect this will involve some data wrangling, are there best practices to format this type of data into the RE task format for Prodigy?

Specifically, in the aspect of tokenization, I am able to use Spacy's blank tokenizer (spacy.blank("en")) in Python to tokenize each sentence, and use token attributes to map the entity labels in the table to Spacy's tokens. However, I am not sure how to reproduce the "ws" field from the RE task format. It seems that field indicates whether to show a whitespace in the Prodigy interface. How can I reliably reproduce Prodigy's behaviour for that field?

Finally, later on other annotators in my team will annotate the texts from scratch. I want to make sure that my review annotations on top of the pre-annotated data are compatible with the annotations from scratch from the other annotators. In that sense, I hope to not introduce incompatibility since the data will be loaded differently in both cases (pre-annotated data for review vs texts from scratch for other annotators). At the end I would like to use evaluation recipes on my review dataset and the annotator's data combined, hence the importance of compatibility.

Thanks

Hi @ale,

You're right. Translating between data formats will require additional Python scripting outside Prodigy. I'd say that using spaCy tokenizer and spaCy doc and span data structures to make sure the start and end offsets are aligned to the tokenization used is the most import "good practice" recommendation here.
If the offsets can be translated into a spaCy span given the tokenization used you can safely use that to set entity and rel annotations. Otherwise, such script should raise a warning so that you can inspect the cases (and reasons) for misalignment. I'm attaching an example of such script below.
It essentially, processes the CSV examples one by one and translates them into Prodigy task dictionary.
It starts with converting the text to a spaCy doc and trying to set entities and relations attributes according to the char offset given.

As for the "ws" atrribute, which is an indication whether the token is followed by a whitespace, we would just copy token.whitespace_ atribute from spaCy token when translating from spaCy representation to Prodigy representation. In fact we have an undocumented helper get_token that you can import from prodigy.components.preprocessing (actually my example script below imports it) that does just that:

def get_token(token: "Token", i: int) -> Dict[str, Any]:
    """Create a token dict for a Token object. Helper function used inside
    add_tokens preprocessor or standalone if recipes need more flexibility.

    token (spacy.tokens.Token): The token.
    i (int): The index of the token. Important: we're not using token.i here,
        because that might not actually reflect the correct token index in the
        example (e.g. when sentence segmentation is enabled).
    RETURNS (dict): The token representation.
    """
    return {
        "text": token.text,
        "start": token.idx,
        "end": token.idx + len(token.text),
        "id": i,
        "ws": bool(token.whitespace_),
    }

For the compatibility, as long as you use the same tokenization and the same set of labels (both spans and relations) for the revised and "from_scratch" annotations they should be compatible.
The translation from csv to jsonl via spaCy doc object will make sure there are no oddities.
One thing to have in mind (I included the relevant comment in the script), you need check whether the end offsets used in your external annotations are inclusive or exclusive. Prodigy uses exclusive offsets so while writing the Prodigy task dictionary you need to make sure the ending is exclusive.
In my example here I assume the input csv is something like:

Sentence,Ent1,Ent2,REL
Susan lives in New York,0:4,15:22,LIVES_IN
The cat sat on the mat yesterday,4:6,19:21,SITS_ON

As you can see, the end offsets are inclusive so, in my script below, I augment it by 1 to meet Prodigy requirement for exclusive end offsets:

import csv
from typing import Dict, Optional, Tuple

import spacy
import srsly
from prodigy.components.preprocess import get_token
from spacy import Language
from spacy.tokens import Doc, Span
from wasabi import msg


def convert_to_span(
    start_char: int, end_char: int, label: str, doc: Doc, idx: int
) -> Optional[Span]:
    span = doc.char_span(start_char, end_char, label=label)
    if span is None:
        msg.warn(
            f"Misaligned tokenization for entity: {start_char}, {end_char} at row: {idx}"
        )
    return span


def add_annotations(
    text: str,
    head: Tuple[int, int],
    child: Tuple[int, int],
    label: str,
    nlp: Language,
    idx: int,
) -> Dict:
    doc = nlp(text)
    # if the labels for entity are given in the source they should
    # be extracted here and used instead of UNK
    # to make sure the multitoken entites are displayed correctly, all entity
    # labels should be added to `rel.manual` via `--span-label`
    entity_label = "UNK"
    # In Prodigy the last idex is not inclusive so it params should be adjusted accordingly
    head_span = convert_to_span(
        start_char=head[0], end_char=head[1] + 1, label=entity_label, doc=doc, idx=idx
    )
    child_span = convert_to_span(
        start_char=child[0], end_char=child[1] + 1, label=entity_label, doc=doc, idx=idx
    )

    if head_span is None or child_span is None:
        # if the spans are misaligned, return an example w/o any annotations
        return {"text": text, "tokens": [get_token(t, t.i) for t in doc]}

    return {
        "text": doc.text,
        # we are copying directly the token information from spaCy tokenizer
        "tokens": [get_token(t, t.i) for t in doc],
        "spans": [
            {
                "start": entity.start_char,
                "end": entity.end_char,
                "token_start": entity.start,
                "token_end": entity.end,
                "label": entity.label_,
            }
            for entity in [head_span, child_span]
        ],
        "relations": [
            {
                "head": head_span.start,
                "head_span": {
                    "start": head_span.start_char,
                    "end": head_span.end_char,
                    "token_start": head_span.start,
                    "token_end": head_span.end,
                    "label": head_span.label_,
                },
                "child": child_span.start,
                "child_span": {
                    "start": child_span.start_char,
                    "end": child_span.end_char,
                    "token_start": child_span.start,
                    "token_end": child_span.end,
                    "label": child_span.label_,
                },
                "label": label,
            }
        ],
    }


def main():
    nlp = spacy.blank("en")
    jsonl_examples = []

    with open("external_data.csv", mode="r") as file:
        csv_reader = csv.reader(file)
        next(csv_reader)
        for idx, row in enumerate(csv_reader):
            text, head, child, label = row
            head_start_char, head_end_char = map(int, head.split(":"))
            child_start_char, child_end_char = map(int, child.split(":"))

            example = add_annotations(
                text=text,
                head=(head_start_char, head_end_char),
                child=(child_start_char, child_end_char),
                label=label,
                nlp=nlp,
                idx=idx,
            )
            jsonl_examples.append(example)

    output_path = "annotated_data.jsonl"
    srsly.write_jsonl(output_path, jsonl_examples)
    msg.info(f"Saved annotations at {output_path}")


if __name__ == "__main__":
    main()

Try it with just a few examples and see if you can correctly visualize the resultant dataset with rel.manual.
Note that this script also assumes that the first entity is always the head, while the second entity is always the child and that they don't have any entity labels assigned.
The script assigns a dummy UNK label that needs to be listed under --spans-label for rel.manualto correctly group multitoken entities.