Alignment of NER tokens when creating suggestions using Transformers

ENV

pandas==1.4.2
transformers==4.17.0
spacy==3.2.4
spacy-alignments==0.8.5
spacy-legacy==3.0.9
spacy-loggers==1.0.2
spacy-sentence-bert==0.1.2
spacy-transformers==1.1.5
cupy-cuda113==10.5.0

I am using the ber.ner.manual recipe.

@prodigy.recipe(
    "bert.ner.manual",
    # fmt: off
    dataset=("Dataset to save annotations to", "positional", None, str),
    source=("Data to annotate (file path or '-' to read from standard input)", "positional", None, str),
    loader=("Loader (guessed from file extension if not set)", "option", "lo", str),
    label=("Comma-separated label(s) to annotate or text file with one label per line", "option", "l", get_labels),
    tokenizer_vocab=("Tokenizer vocab file", "option", "tv", str),
    lowercase=("Set lowercase=True for tokenizer", "flag", "LC", bool),
    hide_special=("Hide SEP and CLS tokens visually", "flag", "HS", bool),
    hide_wp_prefix=("Hide wordpieces prefix like ##", "flag", "HW", bool),
    suggest_model=("Model to predict labels", "option", "sm", str)
    # fmt: on
)
    def add_tokens(stream):
        for eg in stream:
            eg_tokens = BertTokenizer(eg)
            eg["tokens"] = eg_tokens
            yield eg

    def BertTokenizer(eg):
        tokens = tokenizer.encode(eg["text"])
        eg_tokens = []
        idx = 0
        for (text, (start, end), tid) in zip(
                tokens.tokens, tokens.offsets, tokens.ids
            ):
                # If we don't want to see special tokens, don't add them
            if hide_special and text in special_tokens:
                continue
                # If we want to strip out word piece prefix, remove it from text
            if hide_wp_prefix and wp_prefix is not None:
                if text.startswith(wp_prefix):
                    text = text[len(wp_prefix) :]
            token = {
                    "text": text,
                    "id": idx,
                    "start": start,
                    "end": end,
                    # This is the encoded ID returned by the tokenizer
                    "tokenizer_id": tid,
                    # Don't allow selecting spacial SEP/CLS tokens
                    "disabled": text in special_tokens,
                }
            eg_tokens.append(token)
            idx += 1
        for i, token in enumerate(eg_tokens):
                # If the next start offset != the current end offset, we
                # assume there's whitespace in between
            if i < len(eg_tokens) - 1 and token["text"] not in special_tokens:
                next_token = eg_tokens[i + 1]
                token["ws"] = (
                        next_token["start"] > token["end"]
                        or next_token["text"] in special_tokens
                    )
            else:
                token["ws"] = True
        
        return eg_tokens

With this mask_task function.

def make_tasks(nlp, stream, labels):
        """Add a 'spans' key to each example, with predicted entities."""
        # Process the stream using spaCy's nlp.pipe, which yields doc objects.
        # If as_tuples=True is set, you can pass in (text, context) tuples.
        texts = ((eg["text"], eg) for eg in stream)
        for doc, eg in nlp.pipe(texts, as_tuples=True):
            task = copy.deepcopy(eg)
            spans = []
            
            for ent in doc.ents:

                # Continue if predicted entity is not selected in labels
                if labels and ent.label_ not in labels:
                    continue

                # Create span dict for the predicted entitiy
                try:                    
                    if len(doc._.trf_data.align[ent.start].data) == 0:
                        continue
                    
                    spans.append(
                        {
                            "token_start": int(doc._.trf_data.align[ent.start].data[0][0]),
                            "token_end": int(doc._.trf_data.align[ent.end-1].data[-1][0]),
                            "start": ent.start_char,
                            "end": ent.end_char,
                            "text": ent.text,
                            "label": ent.label_,
                        }
                    )
                except Exception as e:
                    print(e)
                    #import code; code.interact(local=locals())
                    raise e

            task["spans"] = spans            
            # Rehash the newly created task so that hashes reflect added data
            task = set_hashes(task)
            yield task

It seems like the alignment is off. Each labeled token appears to be off by a factor of two characters to the right.

I think it has something to do with the word piece prefix, but i'm going blank trying to figure out how to adjust.

Do you have an example so that we may be able to reproduce the error locally? Also, is this the standard recipe described here?

It's based on the standard recipe yes, I'll put together an example. For ya'll.

You mention that it's "based" on the standard recipe, could you elaborate on the differences and why you added them?

I don't mind sharing the whole thing. Only thing I added was a dedup on the stream and the suggest_model parts. Dedup because I only wanted to see an example once ever, and suggest model to use pretrained model to hit at annotations to speed up the manually annotating, so you can skip examples it got right.

"""This recipe requires Prodigy v1.10+."""
from logging import exception
import random
import prodigy
from typing import List, Optional, Union, Iterable, Dict, Any
from regex import E
from tokenizers import BertWordPieceTokenizer
from prodigy.components.loaders import get_stream
from prodigy.util import get_labels
from prodigy.components.filters import filter_inputs
from prodigy.components.db import connect
from prodigy.components.preprocess import add_tokens
from prodigy.util import split_string, set_hashes
from spacy.util import registry, compile_suffix_regex
import copy
import spacy
from spacy.training import Example
import code
import pickle
from spacy.tokens import Doc

@prodigy.recipe(
    "bert.ner.manual",
    # fmt: off
    dataset=("Dataset to save annotations to", "positional", None, str),
    source=("Data to annotate (file path or '-' to read from standard input)", "positional", None, str),
    loader=("Loader (guessed from file extension if not set)", "option", "lo", str),
    label=("Comma-separated label(s) to annotate or text file with one label per line", "option", "l", get_labels),
    tokenizer_vocab=("Tokenizer vocab file", "option", "tv", str),
    lowercase=("Set lowercase=True for tokenizer", "flag", "LC", bool),
    hide_special=("Hide SEP and CLS tokens visually", "flag", "HS", bool),
    hide_wp_prefix=("Hide wordpieces prefix like ##", "flag", "HW", bool),
    suggest_model=("Model to predict labels", "option", "sm", str)
    # fmt: on
)
def ner_manual_tokenizers_bert(
    dataset: str,
    source: Union[str, Iterable[dict]],
    loader: Optional[str] = None,
    label: Optional[List[str]] = None,
    tokenizer_vocab: Optional[str] = None,
    lowercase: bool = False,
    hide_special: bool = False,
    hide_wp_prefix: bool = False,
    suggest_model: str = None,
) -> Dict[str, Any]:
    """Example recipe that shows how to use model-specific tokenizers like the
    BERT word piece tokenizer to preprocess your incoming text for fast and
    efficient NER annotation and to make sure that all annotations you collect
    always map to tokens and can be used to train and fine-tune your model
    (even if the tokenization isn't that intuitive, because word pieces). The
    selection automatically snaps to the token boundaries and you can double-click
    single tokens to select them.
    Setting "honor_token_whitespace": true will ensure that whitespace between
    tokens is only shown if whitespace is present in the original text. This
    keeps the text readable.
    Requires Prodigy v1.10+ and usese the HuggingFace tokenizers library."""
    stream = get_stream(source, loader=loader, input_key="text")

    input_hashes = connect().get_input_hashes(dataset)

    # You can replace this with other tokenizers if needed
    tokenizer = BertWordPieceTokenizer(tokenizer_vocab, lowercase=lowercase)
    sep_token = tokenizer._parameters.get("sep_token")
    cls_token = tokenizer._parameters.get("cls_token")
    special_tokens = (sep_token, cls_token)
    wp_prefix = tokenizer._parameters.get("wordpieces_prefix")

    def add_tokens(stream):
        for eg in stream:
            eg_tokens = BertTokenizer(eg)
            eg["tokens"] = eg_tokens
            yield eg

    def BertTokenizer(eg):
        tokens = tokenizer.encode(eg["text"])
        eg_tokens = []
        idx = 0
        for (text, (start, end), tid) in zip(
                tokens.tokens, tokens.offsets, tokens.ids
            ):
                # If we don't want to see special tokens, don't add them
            if hide_special and text in special_tokens:
                continue
                # If we want to strip out word piece prefix, remove it from text
            if hide_wp_prefix and wp_prefix is not None:
                if text.startswith(wp_prefix):
                    text = text[len(wp_prefix) :]
            token = {
                    "text": text,
                    "id": idx,
                    "start": start,
                    "end": end,
                    # This is the encoded ID returned by the tokenizer
                    "tokenizer_id": tid,
                    # Don't allow selecting spacial SEP/CLS tokens
                    "disabled": text in special_tokens,
                }
            eg_tokens.append(token)
            idx += 1
        for i, token in enumerate(eg_tokens):
                # If the next start offset != the current end offset, we
                # assume there's whitespace in between
            if i < len(eg_tokens) - 1 and token["text"] not in special_tokens:
                next_token = eg_tokens[i + 1]
                token["ws"] = (
                        next_token["start"] > token["end"]
                        or next_token["text"] in special_tokens
                    )
            else:
                token["ws"] = True
        
        return eg_tokens

    def make_tasks(nlp, stream, labels):
        """Add a 'spans' key to each example, with predicted entities."""
        # Process the stream using spaCy's nlp.pipe, which yields doc objects.
        # If as_tuples=True is set, you can pass in (text, context) tuples.
        texts = ((eg["text"], eg) for eg in stream)
        for doc, eg in nlp.pipe(texts, as_tuples=True):
            task = copy.deepcopy(eg)
            spans = []

            for ent in doc.ents:

                # Continue if predicted entity is not selected in labels
                if labels and ent.label_ not in labels:
                    continue

                # Create span dict for the predicted entitiy
                try:                    
                    if len(doc._.trf_data.align[ent.start].data) == 0:
                        continue
                    # import code; code.interact(local=locals())
                    spans.append(
                        {
                            "token_start": int(doc._.trf_data.align[ent.start].data[0][0]),
                            "token_end": int(doc._.trf_data.align[ent.end-1].data[-1][0]),
                            "start": ent.start_char,
                            "end": ent.end_char,
                            "text": ent.text,
                            "label": ent.label_,
                        }
                    )
                except Exception as e:
                    print(e)
                    #import code; code.interact(local=locals())
                    raise e

            task["spans"] = spans            
            # Rehash the newly created task so that hashes reflect added data
            task = set_hashes(task)
            yield task

    def dedup(stream):
        for eg in stream:
            eg = prodigy.set_hashes(eg)
            if eg["_input_hash"] not in input_hashes:
                yield eg

    stream = add_tokens(stream)
    stream = dedup(stream)

    if suggest_model:
        spacy.prefer_gpu()
        nlp = spacy.load(suggest_model)
        stream = make_tasks(nlp, stream, label)
    
    return {
        "dataset": dataset,
        "stream": stream,
        "view_id": "ner_manual",
        "config": {
            "honor_token_whitespace": True,
            "labels": label,
            "exclude_by": "input",   
            "force_stream_order": True,
     
        }
    }

I bet my logic is wrong for any labels using any model pretrained with said labels.

test inputs

{"text":"Will see matt in two weeks", "meta":{"source":["database"]}}
{"text":"William ran to the store to feltch a pale of water", "meta":{"source":["database"]}}

annotate

prodigy bert.ner.manual matts_ds docs.jsonl --label NAME --tokenizer-vocab vocab.txt --hide-wp-prefix --lowercase -F transformers_tokenizers.py

train

prodigy train ./ner_model --ner matts_ds --config config.cfg --gpu-id=1 --eval-split 0.1

annotate with suggestions

prodigy bert.ner.manual matts_ds new_docs.jsonl --label NAME --tokenizer-vocab vocab.txt --hide-wp-prefix --lowercase -F transformers_tokenizers.py --suggest-model ner_model/model-best

Could you also attach the source of your vocab.txt? Without it, I won't be able to reproduce locally.

Sorry it took so long, it's big so I provided the direct link to the vocab.txt. The problem is kind of intermimment too, seems to be special characte related or something else.

hi @washcycle!

Thanks for the details!

Vincent is off on parental duties but I can try to debug.

I was able to reproduce your problem -- thanks for the code and files. Very helpful!

One initial thought of a potential problem is a known issue with the ner_manual interface. Per the documentation, there's an inconsistency of span token_end values between Prodigy and spaCy by 1 token.

Note that the "token_end" value of the spans is inclusive and not exclusive (like spaCy’s token indices for Span objects or list indices in Python). So a span with start 5 and end 6 will include the tokens 5 and 6 and the token span in spaCy would be doc[token_start : token_end + 1] . We’re hoping to make this consistent in the future, but it’d be a breaking change and require a new version of Prodigy’s data format.

One way I've seen a conversion from the spans from ner_manual to spaCy is to use the Prodigy command data-to-spacy to take the Prodigy spans (inclusive range) and convert it to the spaCy span format (exclusive range).

I'll keep to play around some more but wanted to mention this as it may be a factor and/or help.