Inquiry on Using Relation Extraction Model for Annotation in Prodigy

Hello ,
I am currently training a Named Entity Recognition and Relation Extraction model. I have been using the rel.manual recipe to annotate the relations in my documents and have followed the approach outlined in the rel_component tutorial to train my model (projects/tutorials/rel_component at v3 · explosion/projects · GitHub).
I am reaching out to inquire if there is a way to use this trained model to annotate relations in the Prodigy format. Specifically, I am looking for functionality similar to the ner.model-annotate recipe, which I have successfully used to annotate NER entities in new documents within Prodigy.
Thank you

Welcome to the forum @othmane :wave:

I'm afraid there's not a build-in recipe that would let you pre-annotate with relation annotations. It should be rather straightforward, though to adapt the existing model-annotate recipes to your use case.
You can inspect the implementation of the built-in ner.model-annotate recipe at your-prodigy-installation-path/prodigy/recipes/ner.py (you can find out your Prodigy installation path by running prodigy stats).
Now, if you take the ner.model-annotate recipe as a example, you'll see that the only main thing you'd need to substitute is the logic that adds relation annotations to Prodigy task from the spaCy doc. For ner.model-annoate this logic is implemented in make_ner_suggestions function that looks like this:

def make_ner_suggestions(
    stream: StreamType,
    nlp: Language,
    component: str,
    labels: Iterable[str],
    batch_size: int = DEFAULT_NLP_BATCH_SIZE,
    show_progress_bar: bool = False,
    progress_bar_total: Optional[int] = None,
) -> StreamType:
 """Add a 'spans' key to each example, with predicted entities."""
    validate_component(nlp=nlp, component=component)
    texts = ((eg["text"], eg) for eg in stream)
    for doc, eg in tqdm(
        nlp.pipe(texts, as_tuples=True, batch_size=batch_size),
        total=progress_bar_total,
        disable=not show_progress_bar,
    ):
        task = copy.deepcopy(eg)
        spans = []
        for ent in doc.ents:
            if labels and ent.label_ not in labels:
                continue
            spans.append(
                {
                    "token_start": ent.start,
                    "token_end": ent.end - 1,
                    "start": ent.start_char,
                    "end": ent.end_char,
                    "text": ent.text,
                    "label": ent.label_,
                }
            )
        task["spans"] = spans
        if is_llm_component(nlp, component):
            task["llm"] = doc.user_data["llm_io"][component]
        yield task

So in your case, you would need a very similar function that apart from spans also sets relations by accessing doc._rel or however your model stores the relation predictions. You can inspect the structure of the expected relations dictionary by looking at your Prodigy relation annotations or here in json task format example.
Once you have this logic in place, you just need to apply it in place of make_ner_suggestions in line 516, change the view_id to relations in line 530 and that should be it!
Let me know if you run into issues writing this function, and if you do, please share the name of the attribute and data structure of your relation predictions as stored in spaCy doc object.

Hello ,
Thank you for the detailed guidance , I wanted to let you know that instead of modifying the model-annotate recipe, I opted to add a script that converts the model predictions into Prodigy's format. It seems to be working well so far. The code is inspired from the (projects/tutorials/rel_component/scripts/evaluate.py at v3 · explosion/projects · GitHub) , However, I'd appreciate your feedback on my approach. Here is the code I'm using:

def main(trained_pipeline: Path, test_data: Path,output_path:Path, print_details: bool):
    nlp = spacy.load(trained_pipeline)

    doc_bin = DocBin(store_user_data=True).from_disk(test_data)
    docs = doc_bin.get_docs(nlp.vocab)
    #examples = []
    prodigy_format_data = []
    #compteur = 0
    for gold in docs:
        pred = Doc(
            nlp.vocab,
            words=[t.text for t in gold],
            spaces=[t.whitespace_ for t in gold],
        )
        pred.ents = gold.ents
        for name, proc in nlp.pipeline:
            pred = proc(pred)
        #examples.append(Example(pred, gold))


        # Transform predictions to Prodigy format
        relations = []
        for value, rel_dict in pred._.rel.items():
            for label, score in rel_dict.items():
                if score >= 0.5:  # Adjust threshold as needed
                    #print(f"example {compteur} : {value} --> {score} --> {label}")
                    head_start = value[0]
                    child_start = value[1]
                    
                    head_ent = next((ent for ent in pred.ents if ent.end == head_start), None)
                    child_ent = next((ent for ent in pred.ents if ent.end  == child_start), None)

                    if head_ent and child_ent:
                        head_span = {
                            "start": head_ent.start_char,
                            "end": head_ent.end_char,
                            "token_start": head_ent.start,
                            "token_end": head_ent.end,
                            "label": head_ent.label_,
                        }
                        child_span = {
                            "start": child_ent.start_char,
                            "end": child_ent.end_char,
                            "token_start": child_ent.start,
                            "token_end": child_ent.end,
                            "label": child_ent.label_,
                        }
                        relations.append({
                            "head": head_start,
                            "child": child_start,
                            "head_span": head_span,
                            "child_span": child_span,
                            "color": "#d9fbad" if label == "A pour status" else "#c2f2f6",
                            "label": label,
                        })
        prodigy_format_data.append({
            "text": gold.text,
            "tokens": [{"text": t.text, "start": t.idx, "end": t.idx + len(t)} for t in gold],
            "spans": [{"start": e.start_char, "end": e.end_char, "token_start": e.start, "token_end": e.end - 1, "label": e.label_} for e in gold.ents],
            "relations": relations,
        })


    # Write the Prodigy formatted data to a JSONL file
    with open(output_path, 'w', encoding='utf-8') as f:
        for example in prodigy_format_data:
            f.write(json.dumps(example, ensure_ascii=False) + '\n')

Hi @othmane!

Your approach is definitely a great alternative! Just one thing - in Prodigy the end indices are inclusive. You are subtracting 1 from spaCy token.end when setting spans but not when setting the head_span and the child_span - I suppose that for consistency you should use the same inclusive indices in both places.
Another thing to consider would be to add _annotator_id attribute so that you know the source of the annotation but only if you care about of course. It's not required.
Other than that - well done for sure! :clap:

1 Like

Thank you for your response. Following your advice, I subtracted 1 from token.end of both head_span and child_span. Additionally, I noticed the spans were missing the associated text, so I added that as well. Here is the updated code:

# make the factory work
from rel_pipe import make_relation_extractor, score_relations

# make the config work
from rel_model import create_relation_model, create_classification_layer, create_instances, create_tensors
import json

def main(trained_pipeline: Path, test_data: Path,output_path:Path):
    nlp = spacy.load(trained_pipeline)

    doc_bin = DocBin(store_user_data=True).from_disk(test_data)
    docs = doc_bin.get_docs(nlp.vocab)
    #examples = []
    prodigy_format_data = []
    #compteur = 0
    for gold in docs:
        pred = Doc(
            nlp.vocab,
            words=[t.text for t in gold],
            spaces=[t.whitespace_ for t in gold],
        )
        pred.ents = gold.ents
        for name, proc in nlp.pipeline:
            pred = proc(pred)
        #examples.append(Example(pred, gold))


        # Transform predictions to Prodigy format
        relations = []
        for value, rel_dict in pred._.rel.items():
            for label, score in rel_dict.items():
                if score >= 0.5:  # Adjust threshold as needed
                    #print(f"example {compteur} : {value} --> {score} --> {label}")
                    head_start = value[0]
                    child_start = value[1]
                    
                    head_ent = next((ent for ent in pred.ents if ent.end == head_start), None)
                    child_ent = next((ent for ent in pred.ents if ent.end  == child_start), None)

                    if head_ent and child_ent:
                        head_span = {
                            "start": head_ent.start_char,
                            "end": head_ent.end_char,
                            "token_start": head_ent.start,
                            "token_end": head_ent.end - 1,
                            "label": head_ent.label_,
                        }
                        child_span = {
                            "start": child_ent.start_char,
                            "end": child_ent.end_char,
                            "token_start": child_ent.start,
                            "token_end": child_ent.end - 1,
                            "label": child_ent.label_,
                        }
                        relations.append({
                            "head": head_start - 1,
                            "child": child_start -1,
                            "head_span": head_span,
                            "child_span": child_span,
                            "color": "#d9fbad" if label == "A pour status" else "#c2f2f6",
                            "label": label,
                        })
        prodigy_format_data.append({
            "text": gold.text,
            "tokens": [{"text": t.text, "start": t.idx, "end": t.idx + len(t)} for t in gold],
            "spans": [{"text": e.text,"start": e.start_char, "end": e.end_char, "token_start": e.start, "token_end": e.end - 1, "label": e.label_} for e in gold.ents],
            "relations": relations,
            "meta": gold.user_data.get("meta", {})
        })
    # Write the Prodigy formatted data to a JSONL file
    with open(output_path, 'w', encoding='utf-8') as f:
        for example in prodigy_format_data:
            f.write(json.dumps(example, ensure_ascii=False) + '\n')

if __name__ == "__main__":
    typer.run(main)

Here is an example of the output generated by this script:
RE_Model_Annotation.jsonl (65.2 KB)

However, when I ran this command to check the model annotations

prodigy rel.manual dataset fr_core_news_sm  data.jsonl -l "label1","label2"

I received the following warning : ⚠ Skipped 17 span(s) that were already present in the input data because the tokenization didn't match.
I also tried running the command:

prodigy ner.manual test fr_core_news_sm data.jsonl -l label_1 

and i received this error :
09:22:31: PREPROCESS: Tokenizing examples (running tokenizer only) Traceback (most recent call last): File "<frozen runpy>", line 198, in _run_module_as_main File "<frozen runpy>", line 88, in _run_code File "/usr/local/lib/python3.11/site-packages/prodigy/__main__.py", line 50, in <module> main() File "/usr/local/lib/python3.11/site-packages/prodigy/__main__.py", line 44, in main controller = run_recipe(run_args) ^^^^^^^^^^^^^^^^^^^^ File "cython_src/prodigy/cli.pyx", line 129, in prodigy.cli.run_recipe File "cython_src/prodigy/core.pyx", line 155, in prodigy.core.Controller.from_components File "cython_src/prodigy/core.pyx", line 307, in prodigy.core.Controller.__init__ File "cython_src/prodigy/components/stream.pyx", line 189, in prodigy.components.stream.Stream.is_empty File "cython_src/prodigy/components/stream.pyx", line 204, in prodigy.components.stream.Stream.peek File "cython_src/prodigy/components/stream.pyx", line 317, in prodigy.components.stream.Stream._get_from_iterator File "cython_src/prodigy/components/decorators.pyx", line 165, in inner File "cython_src/prodigy/components/preprocess.pyx", line 203, in add_tokens File "cython_src/prodigy/components/preprocess.pyx", line 275, in prodigy.components.preprocess._add_tokens File "cython_src/prodigy/components/preprocess.pyx", line 237, in prodigy.components.preprocess.sync_spans_to_tokens KeyError: 'id'
Could you please help me identify what might be causing this issue?

Hello again ,
I addressed the issue with the rel.manual by incorporating the ws attribute to the tokens. Take a look at the snippet below:

            "text": gold.text,
            "tokens": [{"text": t.text, "start": t.idx, "end": t.idx + len(t), "ws": t.whitespace_} for t in gold],
            "spans": [{"text": e.text,"start": e.start_char, "end": e.end_char, "token_start": e.start, "token_end": e.end - 1, "label": e.label_} for e in gold.ents],
            "relations": relations,
            "meta": gold.user_data.get("meta", {})
        })

Here is an example of the output generated by the new script:
example-2.jsonl (78.5 KB)

However, I'm still encountering the same issue when executing the command:

prodigy ner.manual test fr_core_news_sm data.jsonl -l label_1 

Hi @othmane,

Apologies for the delay in reply! There two more things to fix in the current version: the ws should be a bool and each token should have the id attribute, so:

"text": gold.text,
            "tokens": [{"text": t.text, "id": i, "start": t.idx, "end": t.idx + len(t), "ws": bool(t.whitespace_)} for i,t in enumerate(gold)],
            "spans": [{"text": e.text,"start": e.start_char, "end": e.end_char, "token_start": e.start, "token_end": e.end - 1, "label": e.label_} for e in gold.ents],
            "relations": relations,
            "meta": gold.user_data.get("meta", {})
        })

Now with this edits the latest inputs file you shared (example-2.jsonl) loads correctly to both ner.manual and rel.manual.
The error you reported with ner.manualsays that the file you're inputting is empty - maybe double check if that's not really the case? Again, fixing these two attributes I couldn't reproduce the empty file error with example-2.jsonl.

One final comment, the relations UI has been designed to work with shorter snippets. In fact, you'll see a warning in the console that snippets are really long. This is because the relations are expected to be within a relatively short context window so to prevent excessive amount of scrolling, it is recommended to split long texts into meaningful snippets before doing the annotation. It would probably be hard to learn relations spanning multiple sentences or paragraphs, anyway.