Training NER and relations extraction (RE) together

I'm fascinated by the tutorial of Sofie on RE. I have a well annotated data set for both NER and RE tasks, but I couldn't find how to train a new model on both tasks.

Any pointers to possibly overlooked examples or documentation? Thanks!

Hi! Happy to hear the REL tutorial was useful to you :slight_smile:

The REL tutorial was meant as an example for implementing your own custom trainable component from scratch, and I think the provided implementation for relation extraction should really be taken as a baseline to start from. I can imagine a realistic application would benefit from additional features or a more complex network architecture.

That said - you can take the code from the example project and construct a config file that refers to the relation extraction component and the NER at the same time. You can extend the provided config with an NER component. If you need inspiration on how to define the NER component, you can run python -m spacy init config -p "ner" ner_config.cfg and merge that config with the REL one, so you'd have a pipeline including tok2vec, ner and relation_extractor. You can decide whether they should share the same tok2vec layer or not.

If you feed in data that has both named entities annotated as well as the relations, it should train both simultaneously. You'll need to use the -c flag on the train command to make sure the custom functions and architectures from the REL code are imported, because these are not built-in in spaCy. In the example project, this is accomplished by doing -c custom_functions.py.

To define your training data, you could follow the same conventions as the REL example project and store the information in the custom attribute doc._.rel, cf here: projects/parse_data.py at v3 · explosion/projects · GitHub. After creating the appropriate Doc objects with the gold-standard data (entities + relations), you can serialize them to file with DocBin to create the binary .spacy files that you can feed into the spacy train command.

I think that's pretty much the general overview. Let me know if you run into specific issues!

Fab as usual! I am experimenting with training both components (NER and REL) independently, but it seems like training both should be beneficial. I will then delve into your example and get back with questions and results.

Thanks Sofie.

1 Like

HI @SofieVL I am trying to combine both ner and rel but it's seems fail.
i am following following blog.

The github repo provide the sample data in spacy format which include ents and rels.
Following are the changes i have done in config

%%writefile /content/relation_extraction_transformer/rel_component/configs/rel_trf.cfg
[paths]
train = null
dev = null
raw = null
init_tok2vec = null

[system]
seed = 342
gpu_allocator = "pytorch"

[nlp]
lang = "en"
pipeline = ["transformer", "ner","relation_extractor"]
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}
batch_size = 1000

[components]

[components.transformer]
factory = "transformer"
max_batch_items = 4096
set_extra_annotations = {"@annotation_setters":"spacy-transformers.null_annotation_setter.v1"}

[components.transformer.model]
@architectures = "spacy-transformers.TransformerModel.v1"
name = "roberta-base"
tokenizer_config = {"use_fast": true}

[components.transformer.model.get_spans]
@span_getters = "spacy-transformers.strided_spans.v1"
window = 64
stride = 48

[components.ner]
factory = "ner"

[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = false
nO = null

[components.ner.model.tok2vec]
@architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 1.0

[components.ner.model.tok2vec.pooling]
@layers = "reduce_mean.v1"

[components.relation_extractor]
factory = "relation_extractor"
threshold = 0.5

[components.relation_extractor.model]
@architectures = "rel_model.v1"

[components.relation_extractor.model.create_instance_tensor]
@architectures = "rel_instance_tensor.v1"

[components.relation_extractor.model.create_instance_tensor.tok2vec]
@architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 1.0

[components.relation_extractor.model.create_instance_tensor.tok2vec.pooling]
@layers = "reduce_mean.v1"

[components.relation_extractor.model.create_instance_tensor.pooling]
@layers = "reduce_mean.v1"

[components.relation_extractor.model.create_instance_tensor.get_instances]
@misc = "rel_instance_generator.v1"
max_length = 20

[components.relation_extractor.model.classification_layer]
@architectures = "rel_classification_layer.v1"
nI = null
nO = null

[initialize]

[initialize.components]

[corpora]

[corpora.dev]
@readers = "Gold_ents_Corpus.v1"
file = ${paths.dev}

[corpora.train]
@readers = "Gold_ents_Corpus.v1"
file = ${paths.train}

[training]
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
accumulate_gradient = 1
patience = 1600000
max_epochs = 0
max_steps = 1000
eval_frequency = 100
frozen_components = []
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
before_to_disk = null
logger = {"@loggers":"spacy.ConsoleLogger.v1"}

[training.batcher]
@batchers = "spacy.batch_by_padded.v1"
discard_oversize = true
size = 2000
buffer = 256

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001

[training.optimizer.learn_rate]
@schedules = "warmup_linear.v1"
warmup_steps = 250
total_steps = 20000
initial_rate = 5e-5

[training.score_weights]
rel_micro_p = 0.0
rel_micro_r = 0.0
rel_micro_f = 1.0

I have only changed the config file can you please tell me what other changes i have to do?

Hi @kbrajwani, if the question is about the repository at GitHub - walidamamou/relation_extraction_transformer, it might make more sense to contact the authors there specifically. If you're running into spaCy-specific issues, it would be good to elaborate a bit on what exactly the problem is (what code are you executing and which errors are you getting), and to post a new message detailing that over at Discussions · explosion/spaCy · GitHub. If you're running into Prodigy-specific issues, can you elaborate further?

Thanks!

I am only taking data from that repo.
i have following error.

================================= train_gpu =================================
Running command: /usr/bin/python3 -m spacy train configs/rel_trf.cfg --output training --paths.train /content/relation_extraction_transformer/relations_training.spacy --paths.dev /content/relation_extraction_transformer/relations_dev.spacy -c ./scripts/custom_functions.py --gpu-id 0
ℹ Saving to output directory: training
ℹ Using GPU: 0

=========================== Initializing pipeline ===========================
[2021-08-24 14:36:12,074] [INFO] Set up nlp object from config
[2021-08-24 14:36:12,083] [INFO] Pipeline: ['transformer', 'ner', 'relation_extractor']
[2021-08-24 14:36:12,087] [INFO] Created vocabulary
[2021-08-24 14:36:12,088] [INFO] Finished initializing nlp object
Downloading: 100% 481/481 [00:00<00:00, 441kB/s]
Downloading: 100% 899k/899k [00:00<00:00, 2.98MB/s]
Downloading: 100% 456k/456k [00:00<00:00, 1.50MB/s]
Downloading: 100% 1.36M/1.36M [00:00<00:00, 4.31MB/s]
Downloading: 100% 501M/501M [00:07<00:00, 63.2MB/s]
Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.bias', 'lm_head.layer_norm.bias', 'lm_head.dense.bias', 'lm_head.decoder.weight', 'lm_head.layer_norm.weight', 'lm_head.dense.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
[2021-08-24 14:36:40,591] [INFO] Initialized pipeline components: ['transformer', 'ner', 'relation_extractor']
✔ Initialized pipeline

============================= Training pipeline =============================
ℹ Pipeline: ['transformer', 'ner', 'relation_extractor']
ℹ Initial learn rate: 0.0
E    #       LOSS TRANS...  LOSS NER  LOSS RELAT...  ENTS_F  ENTS_P  ENTS_R  REL_MICRO_P  REL_MICRO_R  REL_MICRO_F  SCORE 
---  ------  -------------  --------  -------------  ------  ------  ------  -----------  -----------  -----------  ------
⚠ Aborting and saving the final best model. Encountered exception:
KeyError("[E900] Could not run the full pipeline for evaluation. If you
specified frozen components, make sure they were already initialized and
trained. Full pipeline: ['transformer', 'ner', 'relation_extractor']")
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/spacy/training/loop.py", line 281, in evaluate
    scores = nlp.evaluate(dev_corpus(nlp))
  File "/usr/local/lib/python3.7/dist-packages/spacy/language.py", line 1389, in evaluate
    results = scorer.score(examples)
  File "/usr/local/lib/python3.7/dist-packages/spacy/scorer.py", line 135, in score
    scores.update(component.score(examples, **self.cfg))
  File "/content/relation_extraction_transformer/rel_component/scripts/rel_pipe.py", line 201, in score
    return score_relations(examples, self.threshold)
  File "/content/relation_extraction_transformer/rel_component/scripts/rel_pipe.py", line 211, in score_relations
    gold_labels = [k for (k, v) in gold[key].items() if v == 1.0]
KeyError: (0, 1)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.7/dist-packages/spacy/__main__.py", line 4, in <module>
    setup_cli()
  File "/usr/local/lib/python3.7/dist-packages/spacy/cli/_util.py", line 69, in setup_cli
    command(prog_name=COMMAND)
  File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/typer/main.py", line 497, in wrapper
    return callback(**use_params)  # type: ignore
  File "/usr/local/lib/python3.7/dist-packages/spacy/cli/train.py", line 63, in train_cli
    train(nlp, output_path, use_gpu=use_gpu, stdout=sys.stdout, stderr=sys.stderr)
  File "/usr/local/lib/python3.7/dist-packages/spacy/training/loop.py", line 122, in train
    raise e
  File "/usr/local/lib/python3.7/dist-packages/spacy/training/loop.py", line 105, in train
    for batch, info, is_best_checkpoint in training_step_iterator:
  File "/usr/local/lib/python3.7/dist-packages/spacy/training/loop.py", line 226, in train_while_improving
    score, other_scores = evaluate()
  File "/usr/local/lib/python3.7/dist-packages/spacy/training/loop.py", line 283, in evaluate
    raise KeyError(Errors.E900.format(pipeline=nlp.pipe_names)) from e
KeyError: "[E900] Could not run the full pipeline for evaluation. If you specified frozen components, make sure they were already initialized and trained. Full pipeline: ['transformer', 'ner', 'relation_extractor']"

Ok, right. Can you open a new issue on the spaCy discussion forum? That's a more suited place as this is not Prodigy related. Thanks!

sure. Thanks

Hi,
I want to train NER and REL model for my usecase. For training NER and REL model i need to annotate data and save it in spacy binary format. But I have a doubt in annotation of data, I am using Label Studio for annotation and it is doing the annotation for Relation as well as NER but how should I convert its output to .spacy binary format?
Basically, I want to ask, what is expected structure format of the annotate .txt file. As After getting output from Label studio,I can see lot of unccessary key value pairs as well.
Please help me to understand the structure or format of annotatd to be fed for conversion into .spacy binary format.

Thanks.