Misaligned Entities automatic fix

Aziz · June 30, 2021, 10:04am

I'm trying to train NER using prodigy train ner with 'en vectors web lg' so the annotation was done using ner manual with 'en core web sm' and also some post processing was doing. I'm encountering the misaligned tokens which is making the model ignore huge part of the data that I can afford to let go. Also it's tedious to fix them manually, is there anyone here who managed to fix them automatically ?

Example of the misaligned entities using Ines's script :

ines · June 30, 2021, 10:36am

Hi! There's not really an easy general-purpose solution because there are many different causes – sometimes it's just a preprocessing error like an off-by-one, sometimes it points to the tokenization rules not being detailed enough, wihch would require customising the punctuation rules.

Could you share one of the examples and span annotations as plain text so it's easier to inspect them?

In the examples you posted, it doesn't look like the mismatches are caused by punctuation, though (e.g. something like "a_b" being tokenized as ["a_b"] instead of ["a", "_", "b"]). The relevant tokens all seem to refer to standalone, whitespace-delimited tokens.

So are you sure it's not an off-by-one error in the spans caused by your post-processing? You can check this by looking at doc.text[start:end] (i.e. a slice of the plain text) and see what those are referring to.

Aziz · June 30, 2021, 10:55am

, Mizuho Bank, Ltd., : {'start': 754, 'end': 770, 'token_start': 178, 'token_end': 181, 'label': 'ORG'}
Mizuho
[('TOKEN', 'Mizuho')]
Ltd.
[('SPECIAL-1', 'Ltd.')]

(i) an IRS Form W-8BEN or IRS Form W‑8BEN-E, as applicable, or (ii) an IRS Form W-8IMY accompanied by an IRS Form W-8BEN or an IRS Form W-8BEN-E, : {'start': 1053, 'end': 1067, 'token_start': 201, 'token_end': 203, 'label': 'REG'}
IRS
[('TOKEN', 'IRS')]
W-8BEN
[('TOKEN', 'W-8BEN')]

and Swing Line Lenders named : {'start': 825, 'end': 842, 'token_start': 192, 'token_end': 194, 'label': 'ROLE'}
Swing
[('TOKEN', 'Swing')]
Lenders
[('TOKEN', 'Lenders')]

ines · June 30, 2021, 11:24am

Can you share an example that has the full text the span offsets are referring to? That's the interesting part because you want to be mapping the start and end character offsets back into the original text to see which exact string they describe. This way, you can find out what the problem is and whether it's an off-by-one error for example.

Aziz · June 30, 2021, 2:26pm

Misaligned tokens Metropolitan Edison Company, a Pennsylvania corporation ("Met-Ed"), Ohio Edison Company, an Ohio corporation ("OE"), Pennsylvania Power Company, a Pennsylvania corporation ("Penn"), The Toledo Edison Company, an Ohio corporation ("TE"), Jersey Central Power & Light Company, a New Jersey corporation ("JCP&L"), Monongahela Power Company, an Ohio corporation ("MP"), Pennsylvania Electric Company, a Pennsylvania corporation ("Penelec"), The Potomac Edison Company, a Maryland and Virginia corporation ("PE"), West Penn Power Company, a Pennsylvania corporation ("West-Penn", and together with FE, CEI, Met-Ed, OE, Penn, TE, JCP&L, MP, Penelec and PE, the "Borrowers" and each a "Borrower"), the Lenders named therein and party thereto from time to time, Mizuho Bank, Ltd., as Administrative Agent, and the Fronting Banks and Swing Line Lenders named therein and party thereto from time to time. Pursuant to the provisions of Section 2.16 of the Credit Agreement, the undersigned hereby certifies that (i) it is the sole record owner of the Advance(s) (as well as any Note(s) evidencing such Advance(s)) in respect of which it is providing this certificate {'start': 925, 'end': 936, 'token_start': 210, 'token_end': 211, 'label': 'REF'}

Section
[('TOKEN', 'Section')]
2.16
[('TOKEN', '2.16')]

ines · June 30, 2021, 2:35pm

It looks like your span annotations somehow ended up being off? If you look at the annotated span at text[925:936], the text span it refers to is "the provisi", while the real span you're looking for, "Section 2.16", only starts at character 943.

How did you post-process your data? Maybe something went wrong here and you ended up with the text shifting in a way that made the original spans not align anymore?

Aziz · July 2, 2021, 3:31pm

Yep that was it Ines thank you !

Topic		Replies	Views
UserWarning: [W030] Some entities could not be aligned in the text usage , ner , spacy	1	1574	April 23, 2021
rel.manual not accepting entities because of tokenization ner , solved , relations	7	1055	April 17, 2024
Error while training NER model usage , spacy , training	4	1844	September 16, 2021
Misaligned entities only in train-curve ner , nightly	4	807	July 8, 2021
Insert Exception to skip cases where tokens are misaligned. usage , ner , spacy	1	479	October 12, 2020

Misaligned Entities automatic fix

Related topics