annotations imported via db-in not showned

evince360 · August 27, 2024, 8:32am

We developed a prodigy custom recipe based on the relations interface (Annotation interfaces · Prodigy · An annotation tool for AI, Machine Learning & NLP) to label entities on a polish text.

The entities are basically legal citations.

As we have an AI model that is already able to identify legal citations in polish text we would like to ease the job of annotators

So we resort to populate the prodigy database with structured data from a json with spans and other stuffs produced by our AI model. We were able to successfully populate the database with the command prodigy db-in

Unfortunately when we launch the prodigy project the text is correctly displayed but the annotations on entities(legal citations) are not displayed graphically in Prodigy

How can we solve this issue? Is there something missing on our side?

magdaaniol · August 27, 2024, 1:58pm

Welcome to the forum @evince360

What exactly do you mean by "populate the prodigy database with structured data from a json with spans and other stuffs produced by our AI model"?
Could you maybe provide an example line from your input jsonl after populating it with the spans?

In general, to pre-annotate text, the spans in Prodigy format should be added to annotation examples. So the final input jsonl structure (after pre-annotation with the AI model and before inputting it to Prodigy span.manual or ner.manual recipe) should be like this:

{
  "text": "First look at the new MacBook Pro",
  "spans": [
    {"start": 22, "end": 33, "label": "PRODUCT", "token_start": 5, "token_end": 6}
  ],
  "tokens": [
    {"text": "First", "start": 0, "end": 5, "id": 0},
    {"text": "look", "start": 6, "end": 10, "id": 1},
    {"text": "at", "start": 11, "end": 13, "id": 2},
    {"text": "the", "start": 14, "end": 17, "id": 3},
    {"text": "new", "start": 18, "end": 21, "id": 4},
    {"text": "MacBook", "start": 22, "end": 29, "id": 5},
    {"text": "Pro", "start": 30, "end": 33, "id": 6}
  ]
}

You can check the required formats in each annotation interface documentation.
Please note that the alignment of tokens and spans offsets is crucial.

If you can implement your AI model as a spaCy pipeline, you can use it directly with Prodigy either for pre-annotating offline using one of model-as-annotator recipes or online i.e. have the model provide the suggestions while you annotate using .correct recipes.
Prodigy will take care of tokenization and span offsets generation in that case.
Also, please see the " I want to plug in and correct a non-spaCy model." section in the NER quickstart for using Prodigy with non-spacy models.

magdaaniol · August 31, 2024, 12:05pm

Hi @evince360,

Thanks for providing the sample of the input data.
The structure of your input is indeed correct and the spans offsets and tokens are aligned within your example. The problem is that your tokenization does not include the whitespace between tokens, while Prodigy assumes the tokenization takes that into account and adds the whitespace attribute to each token by default if it's not specified in the incoming data.
This is why after Prodigy preprocessing the examples become misaligned and Prodigy rejects the existing spans.
To illustrate:
Here are the first two tokens from the input file

{'text': 'W', 'start': 0, 'end': 1, 'id': 0}
{'text': 'drodze', 'start': 1, 'end': 7, 'id': 1}

As you can see, there's no extra whitespace character between W and drodze and Prodigy UI needs that information so it adds it by default to the effect that the resulting tokenization is:

{'text': 'W', 'start': 0, 'end': 1, 'id': 0}
{'text': 'drodze', 'start': 2 'end': 8, 'id': 1}

This way the text can be rendered and also the spans your are annotating make more sense. Your current EU entity is uchwałynr14/2018, while what we should be training/annotating and eventually extracting is uchwały nr 14/2018.

This to explain why the misalignment is happening now what to do about it :

Option 1: fix your current model tokenization so that it takes into account the white space between tokens and redo the preannotation.
Option 2: retokenze the current text by adding whitespace by default after each token. This is not super exact as there might be tokens which do not have white space after them but in the case of Polish it shouldn't be too much trouble as long is it is consistent.
Fixing spans offsets should be easy as the tokens indices are correct. All we need to fix are the character offsets.
Here's a script for retokenization as described above. You might test it out a bit more but should work:

import spacy
import srsly
from typing import Dict, List
from spacy.language import Language
from spacy.tokens import Doc
from wasabi import msg
import copy

def update_tokens(tokens: List[Dict]) -> List[Dict]:
    modified_tokens = []
    prev_end = -1  # Initialize to -1 to handle the first token correctly

    for token in tokens:
        token_copy = copy.deepcopy(token)
        
        if token["id"] > 0:
            token_copy["start"] = prev_end + 1
            token_copy["end"] = token_copy["start"] + len(token_copy["text"])
            token_copy["ws"] = True
        else:
            token_copy["ws"] = True
        
        modified_tokens.append(token_copy)
        prev_end = token_copy["end"]

    return modified_tokens

def update_spans(spans: List[Dict], tokens: List[Dict]) -> List[Dict]:
    return [
        {**span, "start": tokens[span["token_start"]]["start"], "end": tokens[span["token_end"]]["end"]}
        for span in spans
    ]

def is_aligned(eg: Dict, nlp: Language) -> bool:
    words = [token["text"] for token in eg["tokens"]]
    spaces = [token.get("ws", True) for token in eg["tokens"]]
    doc = Doc(nlp.vocab, words=words, spaces=spaces)
    return all(doc.char_span(s["start"], s["end"], s["label"]) is not None for s in eg.get("spans", []))

def process_example(eg: Dict, nlp: Language) -> Dict:
    eg_copy = eg.copy()
    eg_copy["tokens"] = update_tokens(eg["tokens"])
    eg_copy["spans"] = update_spans(eg["spans"], eg_copy["tokens"])
    return eg_copy

def main():
    nlp = spacy.blank("pl")
    data = srsly.read_jsonl("input.jsonl")
    modified_data = []

    for eg in data:
        processed_eg = process_example(eg, nlp)
        if is_aligned(processed_eg, nlp):
            modified_data.append(processed_eg)
        else:
            msg.warn("Misaligned example")
            print("Tokens:", *processed_eg["tokens"], sep="\n")
            print("Spans:", *processed_eg["spans"], sep="\n")
            return

    srsly.write_jsonl("modified_data.jsonl", modified_data)

if __name__ == "__main__":
    main()

This should result in the Prodigy UI finding and rendering the spans correctly:

Topic		Replies	Views
Importing existing custom annotated data from brat usage	7	1895	September 29, 2018
Edit saved annotations ner , solved	4	1372	March 2, 2018
Using a handmade annotation file for model training ner , best-practices	3	1626	June 22, 2018
Issues with db-in and CSV usage , database , solved	1	657	June 17, 2020
NER manual source data fomat Getting Started usage , ner , spacy	1	240	September 21, 2022

annotations imported via db-in not showned

Related topics