partial word as entity

yishairasowsky · December 12, 2019, 2:24pm

can i select a partial word as an entity. the reason i want to do this is because sometimes i receive an erroneous word split, such as

i am not sure how i can just label the money amount without the mistaken prefix.
what about manually correcting the text back in the jsonl file? is that too clunky of an approach?

nix411 · December 16, 2019, 10:33am

I'd recommend modifying your tokenizer so that of$23,318.20 get tokenized to ["of", "$", "23,318.20"] which is not the case with the default english tokenizer.

from spacy.util import compile_infix_regex

def create_custom_tokenizer(nlp):
    infixes = nlp.Defaults.infixes + (
        r"[$]",
    )

    tokenizer = nlp.Defaults.create_tokenizer(nlp)
    tokenizer.infix_finditer = compile_infix_regex(infixes).finditer

    return tokenizer

nlp.tokenizer = create_custom_tokenizer(nlp)
assert [t for t in nlp('of$23,318.20')] == ["of", "$", "23,318.20"]

Then you can nlp.to_disk and use that model with the custom tokenizer.

Note I am not from Explosion AI so don't take this as an "official answer".

ines · December 16, 2019, 11:46am

Yes, that's exactly what I would have recommended. If the entities you want to extract do not map to valid token boundaries, your model wouldn't be learning anything from those annotations anyways, even if you were able to create them and highlight partial entities. So if this is a common occurrence, you want to make sure that your tokenization rules produce the tokens you need.

Topic		Replies	Views
Using spans to split tokens	2	193	November 30, 2023
Custom English Tokenizer usage , spacy	0	533	May 7, 2019
Extracting numeric token for several entities in order using Spacy usage , spacy , off-topic	0	717	October 1, 2020
Annotating strings without correct separation ner , best-practices	8	193	November 21, 2024
Custom Tokenizer help ner , spacy	1	320	December 23, 2022

partial word as entity

Related topics