character-based Matching

Infinity · July 21, 2020, 2:13am

I use prodigy to annotate Chinese sentences. Does Token Matcher suppose character-based Matching?

ines · July 21, 2020, 9:32am

Hi! If you're using --patterns (powered by spaCy's Matcher under the hood), the spans here will refer to tokens produced by the model/tokenizer. So you couldn't match substrings.

If you're annotating with ner.manual and --highlight-chars, you can select individual characters and pre-defined spans don't have to map to tokens. So you can add "spans" to your incoming examples that refer to characters – you'd just have to implement the matching yourself, e.g. using regex. Each span you add needs a start and end offset and a label – you can see an example of the format here.

In your custom recipe, you could then do something like this:

# Your custom regex etc. that adds matched character spans
stream = add_your_character_spans(stream)
# Add "tokens" key to the tasks, either with words or characters
stream = add_tokens(nlp, stream, use_chars=highlight_chars)

Infinity · July 28, 2020, 7:45am

According to your guiding, I solved the problem through my custom recipe, by self-carving the add_your_character_spans function. Thanks a lot for your help.

Now, I have a further question. Can I customize spaCy's Tokenizer class, so that I can use --patterns(with Token Matcher) to get the same result?

If so, It's convenient for the situation that one sentence includes Chinese and English at the same time. Then we can tokenize English by word-based, and Chinese by character-based within a sentence.

adriane · July 28, 2020, 11:37am

As long as as you just want to split all Chinese characters (rather than any kind of real word segmentation), I think you can modify the English tokenizer settings to do this.

This will treat any characters in our predefined CJK character class as a single token by finding them as prefixes, suffixes, and infixes:

import spacy
from spacy.lang.char_classes import _cjk

nlp = spacy.blank("en")
# alternative: some English model with existing components/vectors
# nlp = spacy.load("en_core_web_sm")

prefixes = list(nlp.Defaults.prefixes) + [r"[{c}]".format(c=_cjk)]
suffixes = list(nlp.Defaults.suffixes) + [r"[{c}]".format(c=_cjk)]
infixes = list(nlp.Defaults.infixes) + [r"[{c}]".format(c=_cjk)]
nlp.tokenizer.prefix_search = spacy.util.compile_prefix_regex(prefixes).search
nlp.tokenizer.suffix_search = spacy.util.compile_suffix_regex(suffixes).search
nlp.tokenizer.infix_finditer = spacy.util.compile_infix_regex(infixes).finditer

nlp.to_disk("/path/to/en_zhchar_model")

Then use /path/to/en_zhchar_model as the base model when calling prodigy. You can also provide a different character range here if the _cjk one isn't quite what you want.

And for the tokenizer to handle the English tokens like it would in plain English text (mainly in terms of the exceptions), they need to be separated from the Chinese characters by whitespace. If you have something like can't without separating whitespace it might not get split into ca n't, depending on the context.

To do this more consistently, you would want to identify the spans in each writing system and tokenize them separately. This is totally possible if you'd like to do it, but would require a bit more effort than just modifying the tokenizer as above.

Topic		Replies	Views
Custom Tokenization Support for Spacy (and by extension Prodigy). spacy	3	1746	January 24, 2019
Add tokenization rule usage , spacy	4	728	May 15, 2020
Providing NER token spans only (no character offsets) usage , spacy , best-practices	2	1876	August 12, 2019
Using Prodigy to annotate data and train a tokenizer, or to fix the default tokenizer. spacy , custom	4	1339	March 11, 2020
Custom Tokenizer help ner , spacy	1	320	December 23, 2022

character-based Matching

Related topics