Hi @ale,
You're right. Translating between data formats will require additional Python scripting outside Prodigy. I'd say that using spaCy tokenizer and spaCy doc and span data structures to make sure the start and end offsets are aligned to the tokenization used is the most import "good practice" recommendation here.
If the offsets can be translated into a spaCy span given the tokenization used you can safely use that to set entity and rel annotations. Otherwise, such script should raise a warning so that you can inspect the cases (and reasons) for misalignment. I'm attaching an example of such script below.
It essentially, processes the CSV examples one by one and translates them into Prodigy task dictionary.
It starts with converting the text to a spaCy doc and trying to set entities and relations attributes according to the char offset given.
As for the "ws"
atrribute, which is an indication whether the token is followed by a whitespace, we would just copy token.whitespace_
atribute from spaCy token when translating from spaCy representation to Prodigy representation. In fact we have an undocumented helper get_token
that you can import from prodigy.components.preprocessing
(actually my example script below imports it) that does just that:
def get_token(token: "Token", i: int) -> Dict[str, Any]:
"""Create a token dict for a Token object. Helper function used inside
add_tokens preprocessor or standalone if recipes need more flexibility.
token (spacy.tokens.Token): The token.
i (int): The index of the token. Important: we're not using token.i here,
because that might not actually reflect the correct token index in the
example (e.g. when sentence segmentation is enabled).
RETURNS (dict): The token representation.
"""
return {
"text": token.text,
"start": token.idx,
"end": token.idx + len(token.text),
"id": i,
"ws": bool(token.whitespace_),
}
For the compatibility, as long as you use the same tokenization and the same set of labels (both spans and relations) for the revised and "from_scratch" annotations they should be compatible.
The translation from csv
to jsonl
via spaCy doc object will make sure there are no oddities.
One thing to have in mind (I included the relevant comment in the script), you need check whether the end offsets used in your external annotations are inclusive or exclusive. Prodigy uses exclusive offsets so while writing the Prodigy task dictionary you need to make sure the ending is exclusive.
In my example here I assume the input csv is something like:
Sentence,Ent1,Ent2,REL
Susan lives in New York,0:4,15:22,LIVES_IN
The cat sat on the mat yesterday,4:6,19:21,SITS_ON
As you can see, the end offsets are inclusive so, in my script below, I augment it by 1 to meet Prodigy requirement for exclusive end offsets:
import csv
from typing import Dict, Optional, Tuple
import spacy
import srsly
from prodigy.components.preprocess import get_token
from spacy import Language
from spacy.tokens import Doc, Span
from wasabi import msg
def convert_to_span(
start_char: int, end_char: int, label: str, doc: Doc, idx: int
) -> Optional[Span]:
span = doc.char_span(start_char, end_char, label=label)
if span is None:
msg.warn(
f"Misaligned tokenization for entity: {start_char}, {end_char} at row: {idx}"
)
return span
def add_annotations(
text: str,
head: Tuple[int, int],
child: Tuple[int, int],
label: str,
nlp: Language,
idx: int,
) -> Dict:
doc = nlp(text)
# if the labels for entity are given in the source they should
# be extracted here and used instead of UNK
# to make sure the multitoken entites are displayed correctly, all entity
# labels should be added to `rel.manual` via `--span-label`
entity_label = "UNK"
# In Prodigy the last idex is not inclusive so it params should be adjusted accordingly
head_span = convert_to_span(
start_char=head[0], end_char=head[1] + 1, label=entity_label, doc=doc, idx=idx
)
child_span = convert_to_span(
start_char=child[0], end_char=child[1] + 1, label=entity_label, doc=doc, idx=idx
)
if head_span is None or child_span is None:
# if the spans are misaligned, return an example w/o any annotations
return {"text": text, "tokens": [get_token(t, t.i) for t in doc]}
return {
"text": doc.text,
# we are copying directly the token information from spaCy tokenizer
"tokens": [get_token(t, t.i) for t in doc],
"spans": [
{
"start": entity.start_char,
"end": entity.end_char,
"token_start": entity.start,
"token_end": entity.end,
"label": entity.label_,
}
for entity in [head_span, child_span]
],
"relations": [
{
"head": head_span.start,
"head_span": {
"start": head_span.start_char,
"end": head_span.end_char,
"token_start": head_span.start,
"token_end": head_span.end,
"label": head_span.label_,
},
"child": child_span.start,
"child_span": {
"start": child_span.start_char,
"end": child_span.end_char,
"token_start": child_span.start,
"token_end": child_span.end,
"label": child_span.label_,
},
"label": label,
}
],
}
def main():
nlp = spacy.blank("en")
jsonl_examples = []
with open("external_data.csv", mode="r") as file:
csv_reader = csv.reader(file)
next(csv_reader)
for idx, row in enumerate(csv_reader):
text, head, child, label = row
head_start_char, head_end_char = map(int, head.split(":"))
child_start_char, child_end_char = map(int, child.split(":"))
example = add_annotations(
text=text,
head=(head_start_char, head_end_char),
child=(child_start_char, child_end_char),
label=label,
nlp=nlp,
idx=idx,
)
jsonl_examples.append(example)
output_path = "annotated_data.jsonl"
srsly.write_jsonl(output_path, jsonl_examples)
msg.info(f"Saved annotations at {output_path}")
if __name__ == "__main__":
main()
Try it with just a few examples and see if you can correctly visualize the resultant dataset with rel.manual
.
Note that this script also assumes that the first entity is always the head, while the second entity is always the child and that they don't have any entity labels assigned.
The script assigns a dummy UNK
label that needs to be listed under --spans-label
for rel.manual
to correctly group multitoken entities.