NER model fails to detect if first word is an entity unless a non-whitespace character is added to the start of the text

AdirthaBorgohain · December 1, 2021, 1:47pm

So I trained a NER model using spacy with the help of transformers model and I am facing this weird issue. Suppose my input text is:
Endothelial cells (HAECs) with nicotine resulted in NLRP3 ASC inflammasome activation.

In this case, the model can detect all entities as expected (eg. nicotine as drug, NLRP3 as gene). But it fails to detect Endothelial cells as anatomy. However if I add any non-whitespace character like a comma at the start of the text (for eg: ,Endothelial cells (HAECs) with nicotine resulted in NLRP3 ASC inflammasome activation.), then it correctly detects Endothelial cells as anatomy this time. What can be the reason for this? I have been looking around but I still am not able to figure out a possible reason for this.

The config file I am using:

[paths]
train = null
dev = null
vectors = null
init_tok2vec = null

[system]
gpu_allocator = "pytorch"
seed = 0

[nlp]
lang = "en"
pipeline = ["transformer","ner"]
batch_size = 128
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}

[components]

[components.ner]
factory = "ner"
incorrect_spans_key = null
moves = null
update_with_oracle_cut_size = 100

[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = false
nO = null

[components.ner.model.tok2vec]
@architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 1.0
pooling = {"@layers":"reduce_mean.v1"}
upstream = "*"

[components.transformer]
factory = "transformer"
max_batch_items = 4096
set_extra_annotations = {"@annotation_setters":"spacy-transformers.null_annotation_setter.v1"}

[components.transformer.model]
@architectures = "spacy-transformers.TransformerModel.v3"
name = "microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext"
mixed_precision = false

[components.transformer.model.get_spans]
@span_getters = "spacy-transformers.strided_spans.v1"
window = 128
stride = 96

[components.transformer.model.grad_scaler_config]

[components.transformer.model.tokenizer_config]
use_fast = true

[components.transformer.model.transformer_config]

[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[training]
accumulate_gradient = 3
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
patience = 1600
max_epochs = 0
max_steps = 20000
eval_frequency = 200
frozen_components = []
annotating_components = []
before_to_disk = null

[training.batcher]
@batchers = "spacy.batch_by_padded.v1"
discard_oversize = true
size = 2000
buffer = 256
get_length = null

[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = false

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001

[training.optimizer.learn_rate]
@schedules = "warmup_linear.v1"
warmup_steps = 250
total_steps = 20000
initial_rate = 0.00005

[training.score_weights]
ents_f = 1.0
ents_p = 0.0
ents_r = 0.0
ents_per_type = null

[pretraining]

[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null

[initialize.components]

[initialize.tokenizer]

12dmj · December 1, 2021, 3:03pm

Had similar problems with NER when trying to train a model. I had to clean my data up to get it to work.

adriane · December 1, 2021, 3:49pm

This isn't really about prodigy, so let's keep this discussion on the spacy discussions board: NER model fails to detect if first word is an entity unless a non-whitespace character is added to the start of the text · Discussion #9785 · explosion/spaCy · GitHub

Topic		Replies	Views
ner.batch-train after ner.maual results error (Value error : [E024]) ner , spacy , solved	8	2963	June 26, 2019
Issue getting Tranformer-based NER pipeline working usage , spacy , transformers	3	1250	January 29, 2021
NER detection and comma (,) ner	5	2134	March 28, 2018
false positives in Spacy NER usage , spacy	1	1032	November 7, 2019
Training new entity type with en_pytt_bertbaseuncased_lg model usage , ner , transformers	5	2031	August 30, 2019

NER model fails to detect if first word is an entity unless a non-whitespace character is added to the start of the text

Related topics