Misaligned Entities automatic fix

I'm trying to train NER using prodigy train ner with 'en vectors web lg' so the annotation was done using ner manual with 'en core web sm' and also some post processing was doing. I'm encountering the misaligned tokens which is making the model ignore huge part of the data that I can afford to let go. Also it's tedious to fix them manually, is there anyone here who managed to fix them automatically ?

Example of the misaligned entities using Ines's script :

Hi! There's not really an easy general-purpose solution because there are many different causes – sometimes it's just a preprocessing error like an off-by-one, sometimes it points to the tokenization rules not being detailed enough, wihch would require customising the punctuation rules.

Could you share one of the examples and span annotations as plain text so it's easier to inspect them?

In the examples you posted, it doesn't look like the mismatches are caused by punctuation, though (e.g. something like "a_b" being tokenized as ["a_b"] instead of ["a", "_", "b"]). The relevant tokens all seem to refer to standalone, whitespace-delimited tokens.

So are you sure it's not an off-by-one error in the spans caused by your post-processing? You can check this by looking at doc.text[start:end] (i.e. a slice of the plain text) and see what those are referring to.

, Mizuho Bank, Ltd., : {'start': 754, 'end': 770, 'token_start': 178, 'token_end': 181, 'label': 'ORG'}
[('TOKEN', 'Mizuho')]
[('SPECIAL-1', 'Ltd.')]

(i) an IRS Form W-8BEN or IRS Form W‑8BEN-E, as applicable, or (ii) an IRS Form W-8IMY accompanied by an IRS Form W-8BEN or an IRS Form W-8BEN-E, : {'start': 1053, 'end': 1067, 'token_start': 201, 'token_end': 203, 'label': 'REG'}
[('TOKEN', 'IRS')]
[('TOKEN', 'W-8BEN')]

and Swing Line Lenders named : {'start': 825, 'end': 842, 'token_start': 192, 'token_end': 194, 'label': 'ROLE'}
[('TOKEN', 'Swing')]
[('TOKEN', 'Lenders')]

Can you share an example that has the full text the span offsets are referring to? That's the interesting part because you want to be mapping the start and end character offsets back into the original text to see which exact string they describe. This way, you can find out what the problem is and whether it's an off-by-one error for example.

Misaligned tokens Metropolitan Edison Company, a Pennsylvania corporation ("Met-Ed"), Ohio Edison Company, an Ohio corporation ("OE"), Pennsylvania Power Company, a Pennsylvania corporation ("Penn"), The Toledo Edison Company, an Ohio corporation ("TE"), Jersey Central Power & Light Company, a New Jersey corporation ("JCP&L"), Monongahela Power Company, an Ohio corporation ("MP"), Pennsylvania Electric Company, a Pennsylvania corporation ("Penelec"), The Potomac Edison Company, a Maryland and Virginia corporation ("PE"), West Penn Power Company, a Pennsylvania corporation ("West-Penn", and together with FE, CEI, Met-Ed, OE, Penn, TE, JCP&L, MP, Penelec and PE, the "Borrowers" and each a "Borrower"), the Lenders named therein and party thereto from time to time, Mizuho Bank, Ltd., as Administrative Agent, and the Fronting Banks and Swing Line Lenders named therein and party thereto from time to time. Pursuant to the provisions of Section 2.16 of the Credit Agreement, the undersigned hereby certifies that (i) it is the sole record owner of the Advance(s) (as well as any Note(s) evidencing such Advance(s)) in respect of which it is providing this certificate {'start': 925, 'end': 936, 'token_start': 210, 'token_end': 211, 'label': 'REF'}

[('TOKEN', 'Section')]
[('TOKEN', '2.16')]

It looks like your span annotations somehow ended up being off? If you look at the annotated span at text[925:936], the text span it refers to is "the provisi", while the real span you're looking for, "Section 2.16", only starts at character 943.

How did you post-process your data? Maybe something went wrong here and you ended up with the text shifting in a way that made the original spans not align anymore?

Yep that was it Ines thank you !

1 Like