Training a relation extraction component

Hi,

I'm using the rel.manual recipe to annotate named entities as well as relations in a training dataset. Both tasks are done at the same time, --label enabling to annotate relations while --span-label enables named entities annotation.

I managed to train a NER model quite easily with the train recipe, but I am still struggling to train a relation extraction component. While browsing the documentation and the support, I understood that there was no "easy" option to do so yet. Still, I don't understand how to resolve this issue yet.

Could it be possible to provide a very clear and detailed explanation, step by step ?

How can I exploit my annotations for relation extraction and how can I extract relations with SpaCy ?

Thanks

hi @stella!

Thanks for your question and welcome to the Prodigy community :wave:

Have you watched @SofieVL's Relations Extraction video and tutorial:

You may also find this post to be helpful:

This post helps explains why there isn't a simple approach as spaCy doesn't have a built-in relation prediction component:

You may also find using spaCy's GitHub Discussions forum to be more helpful as training relations is more of a spaCy problem than a Prodigy. The spaCy core team supports that forum and they can help answer. There are already related posts like:

Hi,

Thanks for the answer. I already came across most of it while browsing the documentation.

I'd need to be sure of the process of solving my issue.

Let's say I already annotated relations with the rel.manual recipe from Prodigy and used the train recipe for NER model training. Concerning relations, instead of using the train recipe (since there is no automatic way to use the annotated relations), may I use the data-to-spacy recipe ?

Then, what should I do ? May I exploit the training data with a deep learning library (Think / Tensorflow / PyTorch) in order to create a custom relationship extraction component for SpaCy ? I still don't get how to use the relations you annotate with Prodigy. I'm no expert but I didn't see how training data was used in Sofie's video. Didn't she seem to create her relations extraction component from predictions on documents on which only NER was applied ? To me it looks like I annotated relations in Prodigy for nothing ?

Thanks

Hi @stella!

Yes, this is exactly the setup Sofie does. She explicitly says from the beginning she's going to assume she already has a trained ner component.

Yes! Sofie used Thinc for training. You can see the training code here and she carefully explains the code in 8:11 to 18:30 the Thinc model script. She then describes around 22:55 an Overview of the TrainablePipe API and how to implement the custom component. You may not need to know all of the details and can luckily leverage a lot of the project she developed.

I've tried my best to simplify the steps you'd need to do to train your relations component:

1. Clone the rel_component tutorial

python -m spacy project clone tutorials/rel_component

This step assumes you already have spaCy installed, ideally in a fresh virtual environment.

2. Replace annotations with your data

The simplest approach would be to db-out your relations annotations and replace assets/annotations.jsonl with your new file.

3. Modify parse_data.py based on your unique labels

This is likely the toughest step as you'll need to modify her code a bit.

Unfortunately, there isn't a data-to-spacy command for relations as described since there isn't a spaCy component for training relations:

However, Sofie created the script parse_data.py. You may need to simply modify the code, which is similar to this post:

Also, at the end of that post, here's a related post that has an example of someone who modified parse_data.py too:

This user used a different tool for annotation, but then modified the parse_data.py code (see here) to convert the data into a .spacy format. Hopefully that'll give you enough to modify the script.

If you're still having difficulty, please provide an example of your data and your attempted script. We can then help coach you.

4. Train model

I recommend starting with cpu. So if your data is in .spacy format in the data folder (like what parse_data.py does), you can then run spacy train.

Since this project is setup as a spaCy project, if everything is set up, you can then run:

python -m spacy project run train_cpu

This is essentially just running spacy train (see project.yml):

python -m spacy train ${vars.tok2vec_config} --output training --paths.train ${vars.train_file} --paths.dev ${vars.dev_file} -c ./scripts/custom_functions.py

The ${xxx} are spaCy project variables that are specified in the project.yml, e.g., ${vars.tok2vec_config} points to this file:

Recommend to run existing project before on your own data

Before beginning on your own data, I would recommend to run the sample project first on Sofie's data to make sure you have set up everything correctly (e.g., correct spaCy version). This would just two commands:

# assume you have spacy in activated venv
python -m spacy project clone tutorials/rel_component
# this will run the default parse_data, train_cpu, and evaluate commands
python -m spacy project run all 

By running just these two commands, you should be able to rebuild Sofie's trained relation extraction component. She discussed this part in 32:10 - 34:00 more in detail, including some background on spaCy projects.

More advanced: adding in transformer

Once you get the cpu running, then I would recommend following Sofie's instructions on training with a transformer. The big difference is you'll need spacy-transformers installed and be running the transformer config. I highly recommend watching the video from 34:39 - 37:15 where Sofie discusses more about using the transformer for training.

Hope this helps!

2 Likes

Great ! Thanks Ryan, this helps a lot, you're a lifesaver !

To fully understand, may I ask how the information of the annotated relations is stored in the NER model ? I assumed this information would be lost in the process.

Training the NER model using the train recipe (with only --ner parameter) is correct ?

I don't understand the question. Are you asking what is the format of the relations and ner annotated spans? It's important to know that in the relations interface, you can label both entities and relations.

If so, you can see the relations interface docs:

The relations interface lets you annotate directional labelled relationships between tokens and expressions by clicking on the “head” and selecting the “child”, and optionally also assign spans for joint entity and dependency annotation. Single expressions can be part of multiple overlapping relations and you can configure the UI to only show arcs on hover and switch between a single-line tree view and a display with tokens wrapping across lines. If you’re in span annotation mode, clicking on a span will select it so you can click or hit D to delete it. To add a span, click and drag across the tokens, or hold down SHIFT and click on the start and end token.

Here's an example of how the annotated entities ("spans") and "relations" look:

{
  "text": "My mother’s name is Sasha Smith. She likes dogs and pedigree cats.",
  "tokens": [
    {"text": "My", "start": 0, "end": 2, "id": 0, "ws": true},
    {"text": "mother", "start": 3, "end": 9, "id": 1, "ws": false},
    {"text": "’s", "start": 9, "end": 11, "id": 2, "ws": true},
    {"text": "name", "start": 12, "end": 16, "id": 3, "ws": true },
    {"text": "is", "start": 17, "end": 19, "id": 4, "ws": true },
    {"text": "Sasha", "start": 20, "end": 25, "id": 5, "ws": true},
    {"text": "Smith", "start": 26, "end": 31, "id": 6, "ws": true},
    {"text": ".", "start": 31, "end": 32, "id": 7, "ws": true, "disabled": true},
    {"text": "She", "start": 33, "end": 36, "id": 8, "ws": true},
    {"text": "likes", "start": 37, "end": 42, "id": 9, "ws": true},
    {"text": "dogs", "start": 43, "end": 47, "id": 10, "ws": true},
    {"text": "and", "start": 48, "end": 51, "id": 11, "ws": true, "disabled": true},
    {"text": "pedigree", "start": 52, "end": 60, "id": 12, "ws": true},
    {"text": "cats", "start": 61, "end": 65, "id": 13, "ws": true},
    {"text": ".", "start": 65, "end": 66, "id": 14, "ws": false, "disabled": true}
  ],
  "spans": [
    {"start": 20, "end": 31, "token_start": 5, "token_end": 6, "label": "PERSON"},
    {"start": 43, "end": 47, "token_start": 10, "token_end": 10, "label": "NP"},
    {"start": 52, "end": 65, "token_start": 12, "token_end": 13, "label": "NP"}
  ],
  "relations": [
    {
      "head": 0,
      "child": 1,
      "label": "POSS",
      "head_span": {"start": 0, "end": 2, "token_start": 0, "token_end": 0, "label": null},
      "child_span": {"start": 3, "end": 9, "token_start": 1, "token_end": 1, "label": null}
    },
    {
      "head": 1,
      "child": 8,
      "label": "COREF",
      "head_span": {"start": 3, "end": 9, "token_start": 1, "token_end": 1, "label": null},
      "child_span": {"start": 33, "end": 36, "token_start": 8, "token_end": 8, "label": null}
    },
    {
      "head": 9,
      "child": 13,
      "label": "OBJECT",
      "head_span": {"start": 37, "end": 42, "token_start": 9, "token_end": 9, "label": null},
      "child_span": {"start": 52, "end": 65, "token_start": 12, "token_end": 13, "label": "NP"}
    }
  ]
}

Relationships are defined as dictionaries with a "head" and a "child" token index, indicating the direction of the arrow, the corresponding "head_spans" and "child_spans" describing the tokens or spans the relation is attached to, as well as a relation "label" . If "spans" are present in the data, they will be displayed as a merged entity. If relations are added to spans, they will always refer to the last token as the head and child, respectively. If spans with existing relations are merged or split, Prodigy will always try to resolve and reconcile the indices.

Also, be sure to check out the four FAQ questions for relations UI.

So yes, you can train an ner model with prodigy train --ner ner_dataset, where ner_dataset is the name of the Prodigy recipe.

It's important to know that prodigy train is just a wrapper for spacy train. It was developed for users to train as quickly as possible. However, you may find once you create a good workflow that you can train both the ner and relations at the same time if you add the ner components to your config. Just make sure to have ner before relations in your pipeline if you need the entities for the relations prediction.

Thanks, it's very clear now.

Sofie's code works really well.

I still have problems when editing the parse-data.py file. Unfortunately, I can't share my data for confidentiality reasons. I obtained the following error :

==================================== data ====================================
Running command: /usr/bin/python ./scripts/parse_data.py assets/annotations.jsonl data/train.spacy data/dev.spacy data/test.spacy
Traceback (most recent call last):

 File "parse_data.py", line 75, in main
   start = span_end_to_start[relation["head"]]

KeyError: 6


During handling of the above exception, another exception occurred:


Traceback (most recent call last):

 File "rel_component/./scripts/parse_data.py", line 144, in <module>
   typer.run(main)

 File "rel_component/./scripts/parse_data.py", line 119, in main
   msg.fail(f"Skipping doc because of key error: {e} in {example['meta']['source']}")

KeyError: 'meta'

What I did was editing MAP_LABELS with my own labels, for example a line of MAP_LABEL would be :

"REQUIREMENT": "Requires",

Maybe I'd also need to delete this line : article_id = article_id.replace("BioNLP 2011 Genia Shared Task, ", ""), but I didn't do it yet.

I read the articles but didn't find the part where they edit parse_data.py. I only have an annotations.jsonl file, no spacy files, is that correct ?

This is great progress!

Ah yes - your data would have problem with this part of the code starting at line 91:

# only keeping documents with at least 1 positive case
if pos > 0:
    # use the original PMID/PMCID to decide on train/dev/test split
    article_id = example["meta"]["source"]
    article_id = article_id.replace("BioNLP 2011 Genia Shared Task, ", "")
    article_id = article_id.replace(".txt", "")
    article_id = article_id.split("-")[1]
    if article_id.endswith("4"):
        ids["dev"].add(article_id)
        docs["dev"].append(doc)
        count_pos["dev"] += pos
        count_all["dev"] += pos + neg
    elif article_id.endswith("3"):
        ids["test"].add(article_id)
        docs["test"].append(doc)
        count_pos["test"] += pos
        count_all["test"] += pos + neg
    else:
        ids["train"].add(article_id)
        docs["train"].append(doc)
        count_pos["train"] += pos
        count_all["train"] += pos + neg

This part is partitioning the data into train, test, and dev. For this project, the partitions were based on metadata (e.g., if the article_id ends with 4 it'll be in the dev, if article_id ends with 3 it'll be in the test, etc.).

You'd just want to randomly assign records into each partition.

I went ahead and modified the code below:

#parse_data.py
import json
import random

import typer
from pathlib import Path

from spacy.tokens import Span, DocBin, Doc
from spacy.vocab import Vocab
from wasabi import Printer

msg = Printer()

SYMM_LABELS = ["Binds"]
MAP_LABELS = {
    "Pos-Reg": "Regulates",
    "Neg-Reg": "Regulates",
    "Reg": "Regulates",
    "No-rel": "Regulates",
    "Binds": "Binds",
}


def main(json_loc: Path, train_file: Path, dev_file: Path, test_file: Path):
    """Creating the corpus from the Prodigy annotations."""
    random.seed(0)
    Doc.set_extension("rel", default={})
    vocab = Vocab()

    docs = {"train": [], "dev": [], "test": []}
    ids = {"train": set(), "dev": set(), "test": set()}
    count_all = {"train": 0, "dev": 0, "test": 0}
    count_pos = {"train": 0, "dev": 0, "test": 0}

    with json_loc.open("r", encoding="utf8") as jsonfile:
        for line in jsonfile:
            example = json.loads(line)
            span_starts = set()
            if example["answer"] == "accept":
                neg = 0
                pos = 0
                # Parse the tokens
                words = [t["text"] for t in example["tokens"]]
                spaces = [t["ws"] for t in example["tokens"]]
                doc = Doc(vocab, words=words, spaces=spaces)

                # Parse the GGP entities
                spans = example["spans"]
                entities = []
                span_end_to_start = {}
                for span in spans:
                    entity = doc.char_span(
                        span["start"], span["end"], label=span["label"]
                    )
                    span_end_to_start[span["token_end"]] = span["token_start"]
                    entities.append(entity)
                    span_starts.add(span["token_start"])
                doc.ents = entities

                # Parse the relations
                rels = {}
                for x1 in span_starts:
                    for x2 in span_starts:
                        rels[(x1, x2)] = {}
                relations = example["relations"]
                for relation in relations:
                    # the 'head' and 'child' annotations refer to the end token in the span
                    # but we want the first token
                    start = span_end_to_start[relation["head"]]
                    end = span_end_to_start[relation["child"]]
                    label = relation["label"]
                    label = MAP_LABELS[label]
                    if label not in rels[(start, end)]:
                        rels[(start, end)][label] = 1.0
                        pos += 1
                    if label in SYMM_LABELS:
                        if label not in rels[(end, start)]:
                            rels[(end, start)][label] = 1.0
                            pos += 1

                # The annotation is complete, so fill in zero's where the data is missing
                for x1 in span_starts:
                    for x2 in span_starts:
                        for label in MAP_LABELS.values():
                            if label not in rels[(x1, x2)]:
                                neg += 1
                                rels[(x1, x2)][label] = 0.0
                doc._.rel = rels

                # only keeping documents with at least 1 positive case
                if pos > 0:
                    if random.random() < 0.2:
                        docs["test"].append(doc)
                        count_pos["test"] += pos
                        count_all["test"] += pos + neg
                    elif random.random() < 0.5:
                        docs["dev"].append(doc)
                        count_pos["dev"] += pos
                        count_all["dev"] += pos + neg
                    else:
                        docs["train"].append(doc)
                        count_pos["train"] += pos
                        count_all["train"] += pos + neg

    docbin = DocBin(docs=docs["train"], store_user_data=True)
    docbin.to_disk(train_file)
    msg.info(
        f"{len(docs['train'])} training sentences from {len(ids['train'])} articles, "
        f"{count_pos['train']}/{count_all['train']} pos instances."
    )

    docbin = DocBin(docs=docs["dev"], store_user_data=True)
    docbin.to_disk(dev_file)
    msg.info(
        f"{len(docs['dev'])} dev sentences from {len(ids['dev'])} articles, "
        f"{count_pos['dev']}/{count_all['dev']} pos instances."
    )

    docbin = DocBin(docs=docs["test"], store_user_data=True)
    docbin.to_disk(test_file)
    msg.info(
        f"{len(docs['test'])} test sentences from {len(ids['test'])} articles, "
        f"{count_pos['test']}/{count_all['test']} pos instances."
    )


if __name__ == "__main__":
    typer.run(main)

Since I don't have your data, you'll need to modify the labels but I got it to run on the example data. I set arbitrary splits (20% "test", 30% "dev", and 50% "train"); likely you'd be better setting this as a parameter than hardcoding it.

Hope this helps!

1 Like

Hi Ryan,

Sorry for the delay, I was away for few days.

I edited labels but I still have errors :

  File "scripts/parse_data.py", line 137, in <module>
    typer.run(main)

  File "./scripts/parse_data.py", line 78, in main
    start = span_end_to_start[relation["head"]]

I carefully checked the differences between my own data and Sofie's and only found few differences :
First, there is no tokens property/attribute in my annotations.json file, only spans. Second : some of my relations labels are null. Finally, I have a "is_binary" property, it seems like it's always false.

I used prodigy to annotate data and used a blank language model. May it be the reason why there is no tokens property in my annotations.json file ? Could it lead to the error I'm getting ?

Thank you !

hi @stella!

I don't think the tokens would do it. You can add in the tokens:

from prodigy.components.preprocess import add_tokens
import spacy

nlp = spacy.blank("en")
# pretend this is your file, you can load using srsly.read_jsonl
stream = [{"text": "Hello world"}, {"text": "Another text"}]
stream = add_tokens(nlp, stream, skip=True)

This is typically used within custom recipes.

Are you receiving a KeyError? This seems likely due to what you were thinking here.

I glanced back at Sofie's annotations and it seems like they all (or at least most) have relations labels. But I removed one of the records relations annotations and the code still ran fine, so I don't think this is the problem.

Typically, perhaps try a little manual test to discover if there's a specific record. Try to run your code on maybe the first 5 records. If it runs, this at least proves there isn't a systematic problem.

Alternatively, likely a better way to do this is to run at the end of this loop for line in jsonfile: a simple print(line). So in this way, it'll print out each record that was successful. Hopefully, you can then find which record is the culprit.

If you find it's all records (i.e., it won't process any of the records), then could you take one of your smallest examples, change the words in your text/label names and provide that dummy example? Unfortunately without a tangible common example this is a bit hard to diagnose what's the problem

Thanks again for your work towards this. I'm pretty certain we've almost solved your problem.

Thanks for your help, Ryan. I thought I wrote an answer but it got lost, so I'll remade it. Sorry if there are multiple posts, later.

It seems like there is token information after all. I was sure there wasn't and carelly checked earlier, maybe I changed something in the meantime, I don't really know.

On fewer examples, there is another error :

AssertionError: [E923] It looks like there is no proper sample data to initialize the Model of component 'tok2vec'. To check your input data paths and annotation, run: python -m spacy debug data config.cfg and include the same config override values you would specify for the 'spacy train' command.

and the model is still not generated.

When including a specific line in my annotations.json file, I obtain the previous error. And yes, it's a key error. The specific data :

{"text":"A b c, d e f g h i j k l m n o p q r s t",

"_input_hash":-2042879851,

"_task_hash":-2085742229,

"_is_binary":false,

"spans":[{"start":6,"end":14,"token_start":1,"token_end":1,"label":"ROLE"},{"start":28,"end":37,"token_start":5,"token_end":5,"label":"ROLE"},{"start":51,"end":60,"token_start":8,"token_end":8,"label":"OBJECT"},{"start":108,"end":117,"token_start":16,"token_end":16,"label":"STANDARD"},{"start":135,"end":141,"token_start":20,"token_end":20,"label":"OBJECT"}],

"tokens":[{"text":"A","start":0,"end":5,"id":0,"ws":true,"disabled":false},{"text":"b","start":6,"end":14,"id":1,"ws":true,"disabled":false},{"text":"c","start":15,"end":22,"id":2,"ws":false,"disabled":false},{"text":",","start":22,"end":23,"id":3,"ws":true,"disabled":false},{"text":"d","start":24,"end":27,"id":4,"ws":true,"disabled":false},{"text":"e","start":28,"end":37,"id":5,"ws":true,"disabled":false},{"text":"f","start":38,"end":42,"id":6,"ws":true,"disabled":false},{"text":"g","start":43,"end":50,"id":7,"ws":true,"disabled":false},{"text":"h","start":51,"end":60,"id":8,"ws":true,"disabled":false},{"text":"i","start":61,"end":71,"id":9,"ws":true,"disabled":false},{"text":"j","start":72,"end":74,"id":10,"ws":true,"disabled":false},{"text":"k","start":75,"end":79,"id":11,"ws":true,"disabled":false},{"text":"l","start":80,"end":87,"id":12,"ws":true,"disabled":false},{"text":"m","start":88,"end":94,"id":13,"ws":true,"disabled":false},{"text":"n","start":95,"end":98,"id":14,"ws":true,"disabled":false},{"text":"o","start":99,"end":107,"id":15,"ws":true,"disabled":false},{"text":"p","start":108,"end":117,"id":16,"ws":true,"disabled":false},{"text":"q","start":118,"end":127,"id":17,"ws":true,"disabled":false},{"text":"r","start":128,"end":130,"id":18,"ws":true,"disabled":false},{"text":"s","start":131,"end":134,"id":19,"ws":true,"disabled":false},{"text":"t","start":135,"end":141,"id":20,"ws":false,"disabled":false}],"_view_id":"relations","relations":[{"head":6,"child":7,"head_span":{"start":38,"end":42,"token_start":6,"token_end":6,"label":null},"child_span":{"start":43,"end":50,"token_start":7,"token_end":7,"label":null},"color":"#96e8ce","label":"REQUIREMENT"},{"head":5,"child":7,"head_span":{"start":28,"end":37,"token_start":5,"token_end":5,"label":"ROLE"},"child_span":{"start":43,"end":50,"token_start":7,"token_end":7,"label":null},"color":"#c5bdf4","label":"SUBJECT"},{"head":7,"child":8,"head_span":{"start":43,"end":50,"token_start":7,"token_end":7,"label":null},"child_span":{"start":51,"end":60,"token_start":8,"token_end":8,"label":"OBJECT"},"color":"#b5c6c9","label":"DOCUMENTATION"}],"answer":"accept","_timestamp":1676580207}

In my MAP_LABELS, among others are :

"SUBJECT": "Is",
"DOCUMENTATION": "Explains",
"REQUIREMENT": "Requires",

ROLE, OBJECT and STANDARD are not in MAP_LABELS are there are not relations labels.

By the way, I still have :

SYMM_LABELS = ["Binds"]

and wonders how I could safely delete this.

Thank you for your help. I hope we can sort this out.

I have another question :

I wanted not to be stuck and be able to continue the workflow thanks to Sofie's model. I had the following error when trying to load her generated model :

ValueError: [E002] Can't find factory for 'relation_extractor' for language English (en). This usually happens when spaCy calls `nlp.create_pipe` with a custom component name that's not registered on the current language class. If you're using a Transformer, make sure to install 'spacy-transformers'. If you're using a custom component, make sure you've added the decorator `@Language.component` (for function components) or `@Language.factory` (for class components).

Available factories: attribute_ruler, tok2vec, merge_noun_chunks, merge_entities, merge_subtokens, token_splitter, doc_cleaner, parser, beam_parser, lemmatizer, trainable_lemmatizer, entity_linker, ner, beam_ner, entity_ruler, tagger, morphologizer, senter, sentencizer, textcat, spancat, spancat_singlelabel, future_entity_ruler, span_ruler, textcat_multilabel, en.lemmatizer

I've added spacy-transformers to my packages, but it still doesn't work.

hi @stella!

First off, thanks for your updates and your patience. I spent some time yesterday with your previous update and will work on a comprehensive response. Also, I've been working with @SofieVL and the spaCy team. They're working fast on a slightly modified, more generic parse_data.py script. I'll update this response soon to explain that code with an example.

In the meantime, a few quick points.

Thanks for trying to create this example! However, I think that you may changed the text but forgot to update the character indices.

{
     "text":"A b c, d e f g h i j k l m n o p q r s t",
     "_input_hash":-2042879851,
     "_task_hash":-2085742229,
     "_is_binary":false,
     "spans":[{"start":6,"end":14,"token_start":1,"token_end":1,"label":"ROLE"}...

For example, the first span starts/ends on token 1, but still has a "start" and "end" for characters 6-14, which isn't consistent.

[E923] typically happens due to data incorrectly formatted. Because of this, I'll use a new example with my follow up response.

Also, it's worth providing a little explanation of SYMM_LABELS. SYMM_LABELS are labels for undirected relations. This is optional and only for when you have undirected relations. The MAP_VALUES (or how we'll rename in the new script as DIRECTED_LABELS) are for directed relation names. I suspect that you're interested in only directional relations, right?

Thanks for the update!

Just curious, can you first install:

pip install 'spacy[transformers]'
python -m spacy download en_core_web_trf

And then run:

import spacy

nlp = spacy.load("en_core_web_trf")
doc = nlp("Apple shares rose on the news. Apple pie is delicious.")

This is to figure out whether spacy-transformers was installed correctly or not.

But we can come back to transformers after we first get your relations data parsed. Just curious, were you able to run (train) the CPU commands of Sofie's project, i.e., without using transformers?

Wow, thanks for the update and for your efforts. Please also thank the team for their work, it is very important for me to be able to use the component and I greatly appreciate that the team is working on this issue.

I've updated the example. I hope it's okay if it's gibberish :

{"text":"Aaaaa bbbbbbbb ccccccc, ddd eeeeeeeee fffff ggggggg hhhhhhhhh iiiiiiiiii jj kkkk lllllll mmmmmm nnn oooooooo ppppppppp qqqqqqqqq rr sss tttttt","_input_hash":-2042879851,"_task_hash":-2085742229,"_is_binary":false,"spans":[{"start":6,"end":14,"token_start":1,"token_end":1,"label":"ROLE"},{"start":28,"end":37,"token_start":5,"token_end":5,"label":"ROLE"},{"start":51,"end":60,"token_start":8,"token_end":8,"label":"OBJECT"},{"start":108,"end":117,"token_start":16,"token_end":16,"label":"STANDARD"},{"start":135,"end":141,"token_start":20,"token_end":20,"label":"OBJECT"}],"tokens":[{"text":"Under","start":0,"end":5,"id":0,"ws":true,"disabled":false},{"text":"supplier","start":6,"end":14,"id":1,"ws":true,"disabled":false},{"text":"request","start":15,"end":22,"id":2,"ws":false,"disabled":false},{"text":",","start":22,"end":23,"id":3,"ws":true,"disabled":false},{"text":"the","start":24,"end":27,"id":4,"ws":true,"disabled":false},{"text":"Purchaser","start":28,"end":37,"id":5,"ws":true,"disabled":false},{"text":"will","start":38,"end":42,"id":6,"ws":true,"disabled":false},{"text":"provide","start":43,"end":50,"id":7,"ws":true,"disabled":false},{"text":"documents","start":51,"end":60,"id":8,"ws":true,"disabled":false},{"text":"identified","start":61,"end":71,"id":9,"ws":true,"disabled":false},{"text":"in","start":72,"end":74,"id":10,"ws":true,"disabled":false},{"text":"this","start":75,"end":79,"id":11,"ws":true,"disabled":false},{"text":"section","start":80,"end":87,"id":12,"ws":true,"disabled":false},{"text":"except","start":88,"end":94,"id":13,"ws":true,"disabled":false},{"text":"the","start":95,"end":98,"id":14,"ws":true,"disabled":false},{"text":"external","start":99,"end":107,"id":15,"ws":true,"disabled":false},{"text":"standards","start":108,"end":117,"id":16,"ws":true,"disabled":false},{"text":"available","start":118,"end":127,"id":17,"ws":true,"disabled":false},{"text":"on","start":128,"end":130,"id":18,"ws":true,"disabled":false},{"text":"the","start":131,"end":134,"id":19,"ws":true,"disabled":false},{"text":"market","start":135,"end":141,"id":20,"ws":false,"disabled":false}],"_view_id":"relations","relations":[{"head":6,"child":7,"head_span":{"start":38,"end":42,"token_start":6,"token_end":6,"label":null},"child_span":{"start":43,"end":50,"token_start":7,"token_end":7,"label":null},"color":"#96e8ce","label":"REQUIREMENT"},{"head":5,"child":7,"head_span":{"start":28,"end":37,"token_start":5,"token_end":5,"label":"ROLE"},"child_span":{"start":43,"end":50,"token_start":7,"token_end":7,"label":null},"color":"#c5bdf4","label":"SUBJECT"},{"head":7,"child":8,"head_span":{"start":43,"end":50,"token_start":7,"token_end":7,"label":null},"child_span":{"start":51,"end":60,"token_start":8,"token_end":8,"label":"OBJECT"},"color":"#b5c6c9","label":"DOCUMENTATION"}],"answer":"accept","_timestamp":1676580207}

Thanks for the explanation of SYMM_LABELS ! Is it okay if I leave the array empty ? Yes, I'm only interested in directed relations.

I runned your example, it worked well so spacy-transformers looks correctly installed.

It is actually hard to tell if I was able to generate Sofie's model without transformers. I would say yes, as the package was not in my venv but I am not totally sure.

hi @stella!

With the help of Sofie and team, we created a more generic parse_data.py that we hope should work.

Thanks for the updated example.

It seems like you're biggest problem is that you annotated without entities in your relations, using non-entity tokens as your relationship head/child.

For example:

{
   "head":6,
   "child":7,
   "head_span":{
      "start":38,
      "end":42,
      "token_start":6,
      "token_end":6,
      "label":null
   },
   "child_span":{
      "start":43,
      "end":50,
      "token_start":7,
      "token_end":7,
      "label":null
   },
   "color":"#96e8ce",
   "label":"REQUIREMENT"
}

Were you aware of this? Did you set any annotation scheme (i.e., rules/strategy in annotating) to allow that?

Sofie's tutorial assumes using entities in the relation's head/child so any pair of entities is classified as being in a relation or not. Because to be able to predict those without entities, you'd need to run some kind of crazy combinatorial explosion of all plausible pairs of tokens.

This is a very common assumption in NLP, e.g., here's a recent NLP (ACL workshop 2022) survey paper that outlined relation extraction models, highlighting that NER (aka Mention Detection) is the common earlier step to either relation identification (is there a relation between these two entities) or relation classification (what is the relationship between these two entities):

I know that you and your team did a lot of annotations, but unfortunately you likely will need to either drop or relabel those that didn't use an entity within the relationship.

The good news is Sofie and team (big thank you!) have crafted an updated parse_data.py example that should work now for any examples that do include entities in the relations:

# This script was derived from parse_data.py but made more generic as a template for various REL parsing needs

import json
import random
import typer
from pathlib import Path

from spacy.tokens import DocBin, Doc
from spacy.vocab import Vocab
from wasabi import Printer

msg = Printer()

# TODO: define your labels used for annotation either as "symmetrical" or "directed"
SYMM_LABELS = ["Binds"]
DIRECTED_LABELS = ["REQUIREMENT", "SUBJECT", "DOCUMENTATION"]

# TODO: define splits for train/dev/test. What is not in test or dev, will be used as train.
test_portion = 0.2
dev_portion = 0.3

# TODO: set this bool to False if you didn't annotate all relations in all sentences.
# If it's true, entities that were not annotated as related will be used as negative examples.
is_complete = True


def main(json_loc: Path, train_file: Path, dev_file: Path, test_file: Path):
    """Creating the corpus from the Prodigy annotations."""
    Doc.set_extension("rel", default={})
    vocab = Vocab()

    docs = {"train": [], "dev": [], "test": []}
    count_all = {"train": 0, "dev": 0, "test": 0}
    count_pos = {"train": 0, "dev": 0, "test": 0}

    with json_loc.open("r", encoding="utf8") as jsonfile:
        for line in jsonfile:
            example = json.loads(line)
            span_starts = set()
            if example["answer"] == "accept":
                neg = 0
                pos = 0
                # Parse the tokens
                words = [t["text"] for t in example["tokens"]]
                spaces = [t["ws"] for t in example["tokens"]]
                doc = Doc(vocab, words=words, spaces=spaces)

                # Parse the entities
                spans = example["spans"]
                entities = []
                span_end_to_start = {}
                for span in spans:
                    entity = doc.char_span(
                        span["start"], span["end"], label=span["label"]
                    )
                    span_end_to_start[span["token_end"]] = span["token_start"]
                    entities.append(entity)
                    span_starts.add(span["token_start"])
                if not entities:
                    msg.warn("Could not parse any entities from the JSON file.")
                doc.ents = entities

                # Parse the relations
                rels = {}
                for x1 in span_starts:
                    for x2 in span_starts:
                        rels[(x1, x2)] = {}
                relations = example["relations"]
                for relation in relations:
                    # Ignoring relations that are not between spans (they are annotated on the token level
                    if not relation["head"] in span_end_to_start or not relation["child"] in span_end_to_start:
                        msg.warn(f"This script only supports relationships between annotated entities.")
                        break
                    # the 'head' and 'child' annotations refer to the end token in the span
                    # but we want the first token
                    start = span_end_to_start[relation["head"]]
                    end = span_end_to_start[relation["child"]]
                    label = relation["label"]
                    if label not in SYMM_LABELS + DIRECTED_LABELS:
                        msg.warn(f"Found label '{label}' not defined in SYMM_LABELS or DIRECTED_LABELS - skipping")
                        break
                    if label not in rels[(start, end)]:
                        rels[(start, end)][label] = 1.0
                        pos += 1
                    if label in SYMM_LABELS:
                        if label not in rels[(end, start)]:
                            rels[(end, start)][label] = 1.0
                            pos += 1

                # If the annotation is complete, fill in zero's where the data is missing
                if is_complete:
                    for x1 in span_starts:
                        for x2 in span_starts:
                            for label in SYMM_LABELS + DIRECTED_LABELS:
                                if label not in rels[(x1, x2)]:
                                    neg += 1
                                    rels[(x1, x2)][label] = 0.0
                doc._.rel = rels

                # only keeping documents with at least 1 positive case
                if pos > 0:
                    # create the train/dev/test split randomly
                    # Note that this is not good practice as instances from the same article
                    # may end up in different splits. Ideally, change this method to keep
                    # documents together in one split (as in the original parse_data.py)
                    if random.random() < test_portion:
                        docs["test"].append(doc)
                        count_pos["test"] += pos
                        count_all["test"] += pos + neg
                    elif random.random() < (test_portion + dev_portion):
                        docs["dev"].append(doc)
                        count_pos["dev"] += pos
                        count_all["dev"] += pos + neg
                    else:
                        docs["train"].append(doc)
                        count_pos["train"] += pos
                        count_all["train"] += pos + neg

    docbin = DocBin(docs=docs["train"], store_user_data=True)
    docbin.to_disk(train_file)
    msg.info(
        f"{len(docs['train'])} training sentences, "
        f"{count_pos['train']}/{count_all['train']} pos instances."
    )

    docbin = DocBin(docs=docs["dev"], store_user_data=True)
    docbin.to_disk(dev_file)
    msg.info(
        f"{len(docs['dev'])} dev sentences, "
        f"{count_pos['dev']}/{count_all['dev']} pos instances."
    )

    docbin = DocBin(docs=docs["test"], store_user_data=True)
    docbin.to_disk(test_file)
    msg.info(
        f"{len(docs['test'])} test sentences, "
        f"{count_pos['test']}/{count_all['test']} pos instances."
    )


if __name__ == "__main__":
    typer.run(main)

What's great with this example is that you have three "to-dos":

  1. define the direction of your relations and the names of those labels. For example, I went ahead and put in your DIRECTED_LABELS. You can ignore the SYMM_LABELS if you don't have any labeled in your dataset.

  2. Define your splits. We have general splits of 50% train, 20% test, and 30% dev.

  3. Define an assumption on whether you did or didn't annotate all relations in the sentence. Be default we're assuming this is True

After this, you can now run:

python generic_parse_data.py my_annotations.jsonl train.spacy dev.spacy test.spacy

We also included warnings that if your .jsonl annotations do have examples of relations but without entities, it will skip those and provide you a warning:

âš  This script only supports relationships between annotated
entities.

Hopefully, this should be what you need at the moment :crossed_fingers:.

Last, if you don't mind, I want to take a moment to recommend a training video by Matt and several other Explosion docs to talk more about thinking carefully for applied NLP problems.

Matt's video has been incredibly helpful for me and changed how I thought about NLP problems when I first saw it in 2019:

NLP projects are like start-ups. They fail a lot. This isn't a bad thing, it's just you need a lot of iteration to better define your unique problem. It's easy to get caught with the State-of-the-Art models thrown around in academia and the press, but for many "rubber-to-road" NLP real world problems, the hardest part is defining clearly what your goal is.

This is at the heart of Prodigy's design. It's designed to test out ideas extremely fast, especially with a data scientist and domain expert working with annotators. Teams can quickly adapt their unique problems to find the best solution from an annotator, business, and ML/NLP perspecitve.

Matt talks more about this around 6:15 in the talk when he introduces the ML Hierarchy of Needs.

That is, it's important to start with clearly thinking about the business problem you want NLP to solve. This will help set up the problem so that it's easier task for the ML algorithm to learn.

This is where I think your project may have got ahead of yourself by doing a lot of annotations and not realized the complexity of the problem you'd have with annotating relations without entities.

With this knowledge, what's important is that you carefully construct an annotation scheme (e.g., by creating annotation guidelines) and iterating on these guidelines as you find examples that fit and don't fit your guidelines. This is especially important when you have multiple annotators as you need to make sure everyone is annotating consistently and not adding noise simply because there's miscommunication between annotators on how you're defining what each entity or relationship is.

One of the first things you would likely want is require all annotators to include entities within their relations. If you think annotation guidelines aren't enough, Prodigy has a ton of quick tricks, like the validate_answer callback. This could be used if you're labeling both entities and relations at the same time. You can create a small script to check before accepting annotations that the user has correctly annotated relations between entities. Also, we have the --disable-patterns which can be used to disable tokens that meet that pattern to avoid incorrectly annotating.

I helped to write a case study by the Guardian who did an amazing job of creating a fast iterative annotation process around their annotation guidelines for doing quote extraction with NER using Prodigy:

While you don't need to expect to have robust guidelines in your first pass, starting small and iterating on this can go a very long way. That is, start first with some basic working definitions and (ideally) examples of what your entities and relations you want to annotate. Allow annotators to annotate some examples, and ideally have them flag those that may not fit the working definition or contract the examples. Have the annotators discuss this with data scientists and domain experts, and iterate.

Sofie had a wonderful related recent post on this as well:

Feel free to let me know if you have questions. I can understand that all of these materials on our Applied NLP philosophy may be a lot at first. However I think now with the right expectations, learnings, resources, and tool (aka Prodigy), you and your team are ready to take on many applied NLP problems and are on the path for success! (And if you do hit bumps in the road, we'll be here to help :smiley: )

1 Like

Great, awesome work, thank you ! I didn't expect it to be so quick !

Thank you for your very detailed answer. Actually, I am not a beginner in the field of NLP (sorry if it looked like it). I am sure that the relationships I annotated exist only between named entities, so I am quite surprised... Maybe some issue occurred...

No worries about the time it took for me to annotate : I annotated only very few examples on purpose : I wanted to try the whole workflow and see how it worked before spending time in the very time-consuming process of annotation. Until now, I've only used SpaCy to train NER models, so I was very unused to the use of SpaCy for annotating relations.

I'll try the new component with brand new annotations tomorrow and tell you if it worked correctly on my side. Thanks, team !

I think I know why there are relations between spans that are not considered as named entities... Actually, I followed this tutorial :

Consider the example "Obama was born in Hawaii". Obama and Hawaii are named entities. To model the relation existing between Obama and Hawaii, I did exactly the same than in the tutorial. You model the "subject" relation (Obama, born), then the "location" relation (born, Hawaii).

Could you please explain to me how I should proceed ? Should "born" be a named entity / an annotated span ? Or should I reject this annotating scheme and directly model one relation between Obama and Hawaii, which would be the location of birth ?

It works ! We still need to edit parse_data with your version, the team didn't take into account your edits.
Without meta information, it doesn't work otherwise.

Thank you very much for your efforts. :smile:

I just need to be sure of the appropriate way of annotating relations (with the "Obama was born in Hawaii" example and then we're all good !)

hi @stella!

That's great :tada:

Thanks for mentioning this post, because I can see how at first it can be confusing. That example is for dependency parsing, which while it uses the same annotation UI, would be trained differently (e.g., prodigy train --parser. Here's more details on how that is trained (what's nice with that is that there is a built-in spaCy component).

I see now how glancing at it (since it's under the relations extraction), why it can be a bit confusing. We'll see about making this distinction more clearer in the docs.

So if your goal is relation extraction (especially using Sofie's training code), your last point is what you should focus on: "reject this annotating scheme and directly model one relation between Obama and Hawaii, which would be the location of birth".

Sorry to post again, but Sofie formalized this around 1:00 - 4:00 in her video where she specifies that her relation extraction component is strictly for relationships between two named entities:

Hope this answers your questions!

Great, it looked pretty confusing, indeed ! I know why it felt very strange. For dependency parsing, it makes sense.

I forgot, but there is still the issue related to loading a trained model :

ValueError: [E002] Can't find factory for 'relation_extractor' for language English (en). This usually happens when spaCy calls `nlp.create_pipe` with a custom component name that's not registered on the current language class. If you're using a Transformer, make sure to install 'spacy-transformers'. If you're using a custom component, make sure you've added the decorator `@Language.component` (for function components) or `@Language.factory` (for class components).

Available factories: attribute_ruler, tok2vec, merge_noun_chunks, merge_entities, merge_subtokens, token_splitter, doc_cleaner, parser, beam_parser, lemmatizer, trainable_lemmatizer, entity_linker, ner, beam_ner, entity_ruler, tagger, morphologizer, senter, sentencizer, textcat, spancat, spancat_singlelabel, future_entity_ruler, span_ruler, textcat_multilabel, en.lemmatizer

This time, I reused the standard English model when training my own, so the issue is not related to this.