character-based Matching

I use prodigy to annotate Chinese sentences. Does Token Matcher suppose character-based Matching?

Hi! If you're using --patterns (powered by spaCy's Matcher under the hood), the spans here will refer to tokens produced by the model/tokenizer. So you couldn't match substrings.

If you're annotating with ner.manual and --highlight-chars, you can select individual characters and pre-defined spans don't have to map to tokens. So you can add "spans" to your incoming examples that refer to characters – you'd just have to implement the matching yourself, e.g. using regex. Each span you add needs a start and end offset and a label – you can see an example of the format here.

In your custom recipe, you could then do something like this:

# Your custom regex etc. that adds matched character spans
stream = add_your_character_spans(stream)
# Add "tokens" key to the tasks, either with words or characters
stream = add_tokens(nlp, stream, use_chars=highlight_chars)

According to your guiding, I solved the problem through my custom recipe, by self-carving the add_your_character_spans function. Thanks a lot for your help.

Now, I have a further question. Can I customize spaCy's Tokenizer class, so that I can use --patterns(with Token Matcher) to get the same result?

If so, It's convenient for the situation that one sentence includes Chinese and English at the same time. Then we can tokenize English by word-based, and Chinese by character-based within a sentence.

As long as as you just want to split all Chinese characters (rather than any kind of real word segmentation), I think you can modify the English tokenizer settings to do this.

This will treat any characters in our predefined CJK character class as a single token by finding them as prefixes, suffixes, and infixes:

import spacy
from spacy.lang.char_classes import _cjk

nlp = spacy.blank("en")
# alternative: some English model with existing components/vectors
# nlp = spacy.load("en_core_web_sm")

prefixes = list(nlp.Defaults.prefixes) + [r"[{c}]".format(c=_cjk)]
suffixes = list(nlp.Defaults.suffixes) + [r"[{c}]".format(c=_cjk)]
infixes = list(nlp.Defaults.infixes) + [r"[{c}]".format(c=_cjk)]
nlp.tokenizer.prefix_search = spacy.util.compile_prefix_regex(prefixes).search
nlp.tokenizer.suffix_search = spacy.util.compile_suffix_regex(suffixes).search
nlp.tokenizer.infix_finditer = spacy.util.compile_infix_regex(infixes).finditer


Then use /path/to/en_zhchar_model as the base model when calling prodigy. You can also provide a different character range here if the _cjk one isn't quite what you want.

And for the tokenizer to handle the English tokens like it would in plain English text (mainly in terms of the exceptions), they need to be separated from the Chinese characters by whitespace. If you have something like can't without separating whitespace it might not get split into ca n't, depending on the context.

To do this more consistently, you would want to identify the spans in each writing system and tokenize them separately. This is totally possible if you'd like to do it, but would require a bit more effort than just modifying the tokenizer as above.