Hi @Fangjian ,
You're definitely right in thinking that my previous answer was valid for spaCy pipelines only! That's why at the end of it I asked if you're not planning to train a spaCy pipeline because in this case, indeed, you'll have take care of the alignment yourself.
Btw. if you're working with transformer models I do recommend checking spaCy transformers extension in case it's more convienient to stay in spaCy environment. You can train any transformer model with spaCy and there are convenient wrappers for HF transformer models - it's not limited to the built-in models.
In fact, there already exist a spaCy version of SciBert in case it's of interest - not sure it's full vocabulary, though.
To answer your question though, I think the easiest way to be able to use ner.llm.fetch
output in bert.ner.manual
would be to modify the spaCy nlp
object in ner.llm.fetch
to use the WordPiece tokenizer (or whatever tokenizer you need for your model).
In order to use a third-party tokenizer inside a spaCy pipeline, you'd first have to wrap it in a class with spaCy tokenizer API. spaCy docs have an example of such wrapping here. You should be able to adapt it to use SciBert Tokenizer directly.
Then, you'd need to modify the ner.llm.fetch
recipe to so that it uses the tokenizer defined in the previous step. This could be as simple as adding the wrapper class to the recipe file and setting the tokenzer on the nlp object:
# ner.llm.fect
def llm_fetch_ner(
config_path: Path,
source: Union[str, Iterable[dict]],
output: Union[str, Path],
loader: Optional[str] = None,
resume: bool = False,
segment: bool = False,
component: str = "llm",
_extra: List[str] = [], # cfg overrides
):
"""Get bulk zero- or few-shot suggestions with the aid of a large language model
This recipe allows you to get LLM queries upfront, which can help if you want
multiple annotators or reduce the waiting time between API calls.
"""
from tokenizers import BertWordPieceTokenizer
from spacy.tokens import Doc
class BertTokenizer:
def __init__(self, vocab, vocab_file, lowercase=True):
self.vocab = vocab
self._tokenizer = BertWordPieceTokenizer(vocab_file, lowercase=lowercase)
def __call__(self, text):
tokens = self._tokenizer.encode(text)
words = []
spaces = []
for i, (text, (start, end)) in enumerate(zip(tokens.tokens, tokens.offsets)):
words.append(text)
if i < len(tokens.tokens) - 1:
# If next start != current end we assume a space in between
next_start, next_end = tokens.offsets[i + 1]
spaces.append(next_start > end)
else:
spaces.append(True)
return Doc(self.vocab, words=words, spaces=spaces)
log("RECIPE: Starting recipe ner.llm.fetch", locals())
config_overrides = parse_config_overrides(list(_extra)) if _extra else {}
config_overrides[f"components.{component}.save_io"] = True
nlp = assemble(config_path, overrides=config_overrides)
nlp.tokenizer = BertTokenizer(nlp.vocab, "bert-base-uncased-vocab.txt") # overwrite the tokenizer
(...)
Please note that the path to the vocabulary file ber-base-uncased-vocab.txt
is hardcoded here.
Now, when you run this modified ner.llm.fetch
, it should produce BERT tokens and spans which are already aligned with these tokens.
You should be able to use bert.ner.manual
with the file produced by the modified ner.llm.fetch
directly. Since ber.ner.manual
retokenizes the text, you need to make sure the settings of the tokenizer are exactly the same as in ner.fetch.manual
e.g. the lowercase setting. If you don't care about hiding the special symbols you can even directly disable the add_tokens
function of the ber.ner.manual
by commenting out line 94, because your input file will have the tokens and spans already in the BERT format.
Finally, to source code for the ner.llm.fetch
recipe is available at your_prodigy_installation_path/recipes/llm/ner.py
You can double check your Prodigy installation path by running prodigy stats
command.