Training a relation extraction component

hi @stella!

With the help of Sofie and team, we created a more generic parse_data.py that we hope should work.

Thanks for the updated example.

It seems like you're biggest problem is that you annotated without entities in your relations, using non-entity tokens as your relationship head/child.

For example:

{
   "head":6,
   "child":7,
   "head_span":{
      "start":38,
      "end":42,
      "token_start":6,
      "token_end":6,
      "label":null
   },
   "child_span":{
      "start":43,
      "end":50,
      "token_start":7,
      "token_end":7,
      "label":null
   },
   "color":"#96e8ce",
   "label":"REQUIREMENT"
}

Were you aware of this? Did you set any annotation scheme (i.e., rules/strategy in annotating) to allow that?

Sofie's tutorial assumes using entities in the relation's head/child so any pair of entities is classified as being in a relation or not. Because to be able to predict those without entities, you'd need to run some kind of crazy combinatorial explosion of all plausible pairs of tokens.

This is a very common assumption in NLP, e.g., here's a recent NLP (ACL workshop 2022) survey paper that outlined relation extraction models, highlighting that NER (aka Mention Detection) is the common earlier step to either relation identification (is there a relation between these two entities) or relation classification (what is the relationship between these two entities):

I know that you and your team did a lot of annotations, but unfortunately you likely will need to either drop or relabel those that didn't use an entity within the relationship.

The good news is Sofie and team (big thank you!) have crafted an updated parse_data.py example that should work now for any examples that do include entities in the relations:

# This script was derived from parse_data.py but made more generic as a template for various REL parsing needs

import json
import random
import typer
from pathlib import Path

from spacy.tokens import DocBin, Doc
from spacy.vocab import Vocab
from wasabi import Printer

msg = Printer()

# TODO: define your labels used for annotation either as "symmetrical" or "directed"
SYMM_LABELS = ["Binds"]
DIRECTED_LABELS = ["REQUIREMENT", "SUBJECT", "DOCUMENTATION"]

# TODO: define splits for train/dev/test. What is not in test or dev, will be used as train.
test_portion = 0.2
dev_portion = 0.3

# TODO: set this bool to False if you didn't annotate all relations in all sentences.
# If it's true, entities that were not annotated as related will be used as negative examples.
is_complete = True


def main(json_loc: Path, train_file: Path, dev_file: Path, test_file: Path):
    """Creating the corpus from the Prodigy annotations."""
    Doc.set_extension("rel", default={})
    vocab = Vocab()

    docs = {"train": [], "dev": [], "test": []}
    count_all = {"train": 0, "dev": 0, "test": 0}
    count_pos = {"train": 0, "dev": 0, "test": 0}

    with json_loc.open("r", encoding="utf8") as jsonfile:
        for line in jsonfile:
            example = json.loads(line)
            span_starts = set()
            if example["answer"] == "accept":
                neg = 0
                pos = 0
                # Parse the tokens
                words = [t["text"] for t in example["tokens"]]
                spaces = [t["ws"] for t in example["tokens"]]
                doc = Doc(vocab, words=words, spaces=spaces)

                # Parse the entities
                spans = example["spans"]
                entities = []
                span_end_to_start = {}
                for span in spans:
                    entity = doc.char_span(
                        span["start"], span["end"], label=span["label"]
                    )
                    span_end_to_start[span["token_end"]] = span["token_start"]
                    entities.append(entity)
                    span_starts.add(span["token_start"])
                if not entities:
                    msg.warn("Could not parse any entities from the JSON file.")
                doc.ents = entities

                # Parse the relations
                rels = {}
                for x1 in span_starts:
                    for x2 in span_starts:
                        rels[(x1, x2)] = {}
                relations = example["relations"]
                for relation in relations:
                    # Ignoring relations that are not between spans (they are annotated on the token level
                    if not relation["head"] in span_end_to_start or not relation["child"] in span_end_to_start:
                        msg.warn(f"This script only supports relationships between annotated entities.")
                        break
                    # the 'head' and 'child' annotations refer to the end token in the span
                    # but we want the first token
                    start = span_end_to_start[relation["head"]]
                    end = span_end_to_start[relation["child"]]
                    label = relation["label"]
                    if label not in SYMM_LABELS + DIRECTED_LABELS:
                        msg.warn(f"Found label '{label}' not defined in SYMM_LABELS or DIRECTED_LABELS - skipping")
                        break
                    if label not in rels[(start, end)]:
                        rels[(start, end)][label] = 1.0
                        pos += 1
                    if label in SYMM_LABELS:
                        if label not in rels[(end, start)]:
                            rels[(end, start)][label] = 1.0
                            pos += 1

                # If the annotation is complete, fill in zero's where the data is missing
                if is_complete:
                    for x1 in span_starts:
                        for x2 in span_starts:
                            for label in SYMM_LABELS + DIRECTED_LABELS:
                                if label not in rels[(x1, x2)]:
                                    neg += 1
                                    rels[(x1, x2)][label] = 0.0
                doc._.rel = rels

                # only keeping documents with at least 1 positive case
                if pos > 0:
                    # create the train/dev/test split randomly
                    # Note that this is not good practice as instances from the same article
                    # may end up in different splits. Ideally, change this method to keep
                    # documents together in one split (as in the original parse_data.py)
                    if random.random() < test_portion:
                        docs["test"].append(doc)
                        count_pos["test"] += pos
                        count_all["test"] += pos + neg
                    elif random.random() < (test_portion + dev_portion):
                        docs["dev"].append(doc)
                        count_pos["dev"] += pos
                        count_all["dev"] += pos + neg
                    else:
                        docs["train"].append(doc)
                        count_pos["train"] += pos
                        count_all["train"] += pos + neg

    docbin = DocBin(docs=docs["train"], store_user_data=True)
    docbin.to_disk(train_file)
    msg.info(
        f"{len(docs['train'])} training sentences, "
        f"{count_pos['train']}/{count_all['train']} pos instances."
    )

    docbin = DocBin(docs=docs["dev"], store_user_data=True)
    docbin.to_disk(dev_file)
    msg.info(
        f"{len(docs['dev'])} dev sentences, "
        f"{count_pos['dev']}/{count_all['dev']} pos instances."
    )

    docbin = DocBin(docs=docs["test"], store_user_data=True)
    docbin.to_disk(test_file)
    msg.info(
        f"{len(docs['test'])} test sentences, "
        f"{count_pos['test']}/{count_all['test']} pos instances."
    )


if __name__ == "__main__":
    typer.run(main)

What's great with this example is that you have three "to-dos":

  1. define the direction of your relations and the names of those labels. For example, I went ahead and put in your DIRECTED_LABELS. You can ignore the SYMM_LABELS if you don't have any labeled in your dataset.

  2. Define your splits. We have general splits of 50% train, 20% test, and 30% dev.

  3. Define an assumption on whether you did or didn't annotate all relations in the sentence. Be default we're assuming this is True

After this, you can now run:

python generic_parse_data.py my_annotations.jsonl train.spacy dev.spacy test.spacy

We also included warnings that if your .jsonl annotations do have examples of relations but without entities, it will skip those and provide you a warning:

⚠ This script only supports relationships between annotated
entities.

Hopefully, this should be what you need at the moment :crossed_fingers:.

Last, if you don't mind, I want to take a moment to recommend a training video by Matt and several other Explosion docs to talk more about thinking carefully for applied NLP problems.

Matt's video has been incredibly helpful for me and changed how I thought about NLP problems when I first saw it in 2019:

NLP projects are like start-ups. They fail a lot. This isn't a bad thing, it's just you need a lot of iteration to better define your unique problem. It's easy to get caught with the State-of-the-Art models thrown around in academia and the press, but for many "rubber-to-road" NLP real world problems, the hardest part is defining clearly what your goal is.

This is at the heart of Prodigy's design. It's designed to test out ideas extremely fast, especially with a data scientist and domain expert working with annotators. Teams can quickly adapt their unique problems to find the best solution from an annotator, business, and ML/NLP perspecitve.

Matt talks more about this around 6:15 in the talk when he introduces the ML Hierarchy of Needs.

That is, it's important to start with clearly thinking about the business problem you want NLP to solve. This will help set up the problem so that it's easier task for the ML algorithm to learn.

This is where I think your project may have got ahead of yourself by doing a lot of annotations and not realized the complexity of the problem you'd have with annotating relations without entities.

With this knowledge, what's important is that you carefully construct an annotation scheme (e.g., by creating annotation guidelines) and iterating on these guidelines as you find examples that fit and don't fit your guidelines. This is especially important when you have multiple annotators as you need to make sure everyone is annotating consistently and not adding noise simply because there's miscommunication between annotators on how you're defining what each entity or relationship is.

One of the first things you would likely want is require all annotators to include entities within their relations. If you think annotation guidelines aren't enough, Prodigy has a ton of quick tricks, like the validate_answer callback. This could be used if you're labeling both entities and relations at the same time. You can create a small script to check before accepting annotations that the user has correctly annotated relations between entities. Also, we have the --disable-patterns which can be used to disable tokens that meet that pattern to avoid incorrectly annotating.

I helped to write a case study by the Guardian who did an amazing job of creating a fast iterative annotation process around their annotation guidelines for doing quote extraction with NER using Prodigy:

While you don't need to expect to have robust guidelines in your first pass, starting small and iterating on this can go a very long way. That is, start first with some basic working definitions and (ideally) examples of what your entities and relations you want to annotate. Allow annotators to annotate some examples, and ideally have them flag those that may not fit the working definition or contract the examples. Have the annotators discuss this with data scientists and domain experts, and iterate.

Sofie had a wonderful related recent post on this as well:

Feel free to let me know if you have questions. I can understand that all of these materials on our Applied NLP philosophy may be a lot at first. However I think now with the right expectations, learnings, resources, and tool (aka Prodigy), you and your team are ready to take on many applied NLP problems and are on the path for success! (And if you do hit bumps in the road, we'll be here to help :smiley: )

1 Like