Entity Linking (prodigy training)

Hi Team,

Trying to replicate on my dataset nel_emerson, already prelabeled data and generated corpus like this:

docs = []

for obj in data:
    doc = nlp(obj['text'])

    s = skills[obj['meta']['listingId']]
    labels = filterSkillsByConfidence(s)
    
    matcher = PhraseMatcher(nlp.vocab, attr="LOWER")
    matcher.add("SKILL", [nlp(cls['value']) for cls in labels])
    matches = matcher(doc)
    
    entities = list()
    for match_id, start, end in matches:
        span = Span(doc, start, end, label='SKILL')
        match = next(filter(lambda x: x['value'] == span.text, s), None)
        if match:
            skill = match['skills'][0]
            span.kb_id_	 = skill['id']
            entities.append(span)

    doc.ents = spacy.util.filter_spans(entities)
    
    docs.append(doc)

then divide it into training and test set:

train_docs = DocBin()
test_docs = DocBin()

test_index = int(len(data) * 0.2)

for index in range(0, len(docs)-test_index):
    train_docs.add(docs[index])

for index in range(len(docs)-test_index, len(docs)):
    test_docs.add(docs[index])

print(len(train_docs), len(test_docs))
    
train_docs.to_disk('corpus/train.spacy')
test_docs.to_disk('corpus/test.spacy')

By trying to run training:
python -m spacy train configs/nel.cfg --output training --paths.train corpus/train.spacy --paths.dev corpus/test.spacy --paths.kb tmp/kb --paths.base_nlp tmp/model -c scripts/custom_functions.py

I get this warning and constantly 0.17 accuracy.

/Users/fed/Library/Caches/pypoetry/virtualenvs/nel-riFBMyAx-py3.9/lib/python3.9/site-packages/spacy/pipeline/entity_linker.py:276: UserWarning: [W093] Could not find any data to train the Entity Linker on. Is your input data correctly formatted?

Any ideas what it could be?

Also, KB was created like this:

kb_loc = 'tmp/kb'
nlp_dir = 'tmp/model'

nlp = spacy.load(vectors_model, exclude="parser, tagger, lemmatizer")
nlp.add_pipe("sentencizer", first=True)

kb = KnowledgeBase(vocab=nlp.vocab, entity_vector_length=300)

for skill in skills:
    desc_doc = nlp(skill['description']) if skill['description'] is not None else nlp(skill['name'])
    desc_enc = desc_doc.vector
    kb.add_entity(entity=skill['id'], entity_vector=desc_enc, freq=342)
    kb.add_alias(alias=skill['name'], entities=[skill['id']], probabilities=[1])

print(f"Entities in the KB: {len(kb.get_entity_strings())}")
print(f"Aliases in the KB: {kb.get_alias_strings()}")
print()

kb.to_disk(kb_loc)
if not os.path.exists(nlp_dir):
    os.mkdir(nlp_dir)
nlp.to_disk(nlp_dir)

I have removed EntityRuler here, not sure what kind of game it was doing here :frowning: probably that's where I made my mistake.

I would love to get any help here.

I have an assumption, that I need to write patterns (EntityRulers) into KB, any spans matched (labeled as SKILL) and then map them back to KB entity, and if I do have multiple, then reduce probability value?


Ok, I think this change Update NEL prodigy script (#40) · explosion/projects@9399cc1 · GitHub was done specifically to fix ORG detection instead of PERSON, so I don't really need that EntityRules.

============================ Data file validation ============================
✔ Pipeline can be initialized with data
✔ Corpus is loadable

=============================== Training stats ===============================
Language: en
Training pipeline: sentencizer, ner, entity_linker
Frozen components: sentencizer, ner
235 training docs
58 evaluation docs
✔ No overlap between training and evaluation data
⚠ Low number of examples to train a new pipeline (235)

============================== Vocab & Vectors ==============================
ℹ 155932 total word(s) in the data (9299 unique)
ℹ 20000 vectors (684830 unique keys, 300 dimensions)
⚠ 16028 words in training data without vectors (10%)

========================== Named Entity Recognition ==========================
ℹ 19 label(s)
0 missing value(s) (tokens with '-' label)
⚠ Some model labels are not present in the train data. The model
performance may be degraded for these labels after training: 'QUANTITY', 'TIME',
'GPE', 'ORG', 'LOC', 'PERSON', 'PRODUCT', 'LANGUAGE', 'PERCENT', 'ORDINAL',
'LAW', 'FAC', 'NORP', 'WORK_OF_ART', 'DATE', 'EVENT', 'MONEY', 'CARDINAL'.
✔ Good amount of examples for all labels
✔ Examples without occurrences available for all labels
✔ No entities consisting of or starting/ending with whitespace

================================== Summary ==================================
✔ 6 checks passed
⚠ 3 warnings

Instead of giving multiple spans with kb_id for a large chunk of text (I found that assumption here projects/create_corpus.py at e34a56ad5f22ef91a096e08b54481b69da657682 · explosion/projects · GitHub),
I created sentences with a single-span object and recreated the corpus.
This way I don't get any warnings during the training, but training looks really bad:

============================= Training pipeline =============================
ℹ Pipeline: ['sentencizer', 'ner', 'entity_linker']
ℹ Frozen components: ['sentencizer', 'ner']
ℹ Initial learn rate: 0.001
E    #       LOSS ENTIT...  SENTS_F  SENTS_P  SENTS_R  ENTS_F  ENTS_P  ENTS_R  NEL_MICRO_F  NEL_MICRO_R  NEL_MICRO_P  SCORE 
---  ------  -------------  -------  -------  -------  ------  ------  ------  -----------  -----------  -----------  ------
  0       0           0.99   100.00   100.00   100.00    0.00    0.00    0.00        40.00        25.00       100.00    0.46
  0     200          49.97   100.00   100.00   100.00    0.00    0.00    0.00        40.00        25.00       100.00    0.46
  0     400          28.01   100.00   100.00   100.00    0.00    0.00    0.00        40.00        25.00       100.00    0.46
  0     600          32.74   100.00   100.00   100.00    0.00    0.00    0.00        40.00        25.00       100.00    0.46

I think I found the issue, Knowledge Base aliases are mandatory and case sensitive.

Hi Fedya, apologies for the late follow-up!

What the comment at projects/create_corpus.py at e34a56ad5f22ef91a096e08b54481b69da657682 · explosion/projects · GitHub refers to is not that we annotate the full sentence with 1 KB ID, but rather that at that point we create an instance with just a single span in it - i.e. exactly what you did when you said

I created sentences with a single-span object and recreated the corpus.

I just wanted to clarify that point.

But other than that, it looks like you were able to resolve your issue? Is training working well now, or do you have any remaining issues?

Looks like I had those 2 issues that were basically triggering unexpected results, so if anybody arrives here, please check if you do this:

  • single span training at a time with kb-id

  • you MUST have aliases with exact match (case-sensitive)

Anyway, thanks @SofieVL for the support!

1 Like