Hi @evince360,
Thanks for providing the sample of the input data.
The structure of your input is indeed correct and the spans offsets and tokens are aligned within your example. The problem is that your tokenization does not include the whitespace between tokens, while Prodigy assumes the tokenization takes that into account and adds the whitespace attribute to each token by default if it's not specified in the incoming data.
This is why after Prodigy preprocessing the examples become misaligned and Prodigy rejects the existing spans.
To illustrate:
Here are the first two tokens from the input file
{'text': 'W', 'start': 0, 'end': 1, 'id': 0}
{'text': 'drodze', 'start': 1, 'end': 7, 'id': 1}
As you can see, there's no extra whitespace character between W
and drodze
and Prodigy UI needs that information so it adds it by default to the effect that the resulting tokenization is:
{'text': 'W', 'start': 0, 'end': 1, 'id': 0}
{'text': 'drodze', 'start': 2 'end': 8, 'id': 1}
This way the text can be rendered and also the spans your are annotating make more sense. Your current EU
entity is uchwałynr14/2018
, while what we should be training/annotating and eventually extracting is uchwały nr 14/2018
.
This to explain why the misalignment is happening now what to do about it :
Option 1: fix your current model tokenization so that it takes into account the white space between tokens and redo the preannotation.
Option 2: retokenze the current text by adding whitespace by default after each token. This is not super exact as there might be tokens which do not have white space after them but in the case of Polish it shouldn't be too much trouble as long is it is consistent.
Fixing spans offsets should be easy as the tokens indices are correct. All we need to fix are the character offsets.
Here's a script for retokenization as described above. You might test it out a bit more but should work:
import spacy
import srsly
from typing import Dict, List
from spacy.language import Language
from spacy.tokens import Doc
from wasabi import msg
import copy
def update_tokens(tokens: List[Dict]) -> List[Dict]:
modified_tokens = []
prev_end = -1 # Initialize to -1 to handle the first token correctly
for token in tokens:
token_copy = copy.deepcopy(token)
if token["id"] > 0:
token_copy["start"] = prev_end + 1
token_copy["end"] = token_copy["start"] + len(token_copy["text"])
token_copy["ws"] = True
else:
token_copy["ws"] = True
modified_tokens.append(token_copy)
prev_end = token_copy["end"]
return modified_tokens
def update_spans(spans: List[Dict], tokens: List[Dict]) -> List[Dict]:
return [
{**span, "start": tokens[span["token_start"]]["start"], "end": tokens[span["token_end"]]["end"]}
for span in spans
]
def is_aligned(eg: Dict, nlp: Language) -> bool:
words = [token["text"] for token in eg["tokens"]]
spaces = [token.get("ws", True) for token in eg["tokens"]]
doc = Doc(nlp.vocab, words=words, spaces=spaces)
return all(doc.char_span(s["start"], s["end"], s["label"]) is not None for s in eg.get("spans", []))
def process_example(eg: Dict, nlp: Language) -> Dict:
eg_copy = eg.copy()
eg_copy["tokens"] = update_tokens(eg["tokens"])
eg_copy["spans"] = update_spans(eg["spans"], eg_copy["tokens"])
return eg_copy
def main():
nlp = spacy.blank("pl")
data = srsly.read_jsonl("input.jsonl")
modified_data = []
for eg in data:
processed_eg = process_example(eg, nlp)
if is_aligned(processed_eg, nlp):
modified_data.append(processed_eg)
else:
msg.warn("Misaligned example")
print("Tokens:", *processed_eg["tokens"], sep="\n")
print("Spans:", *processed_eg["spans"], sep="\n")
return
srsly.write_jsonl("modified_data.jsonl", modified_data)
if __name__ == "__main__":
main()
This should result in the Prodigy UI finding and rendering the spans correctly: