Is it possible to let a model learn segmentation?

ines · January 7, 2019, 10:32pm

What are you using as the base model? CJK languages definitely require different tokenization, since a "word" is not defined as a whitespace-delimited unit. spaCy currently supports Chinese and Japanese via third-party libraries (see here), so you can use those language classes as the base model. See this thread for more details:

If you use a base model that supports tokenization for the given language, you'll be able to annotate the tokens accordingly. (This is also one of the reasons Prodigy always asks for a base model – it lets you supply language-specific or even your own custom tokenization rules.)

Btw, speaking of learning tokenization: In the upcoming version of spaCy, the parser will be able to learn merging tokens, which will be very useful for training CJK models. The rule-based tokenization can then only split on characters, and the model will be able to predict whether characters should be merged into one token. Depending on the data, this can improve accuracy by a lot.

Topic		Replies	Views
NER tagging in non-alphabetic language ner , spacy	1	410	May 2, 2022
Can it work on Traditional Chinese or Simplified Chinese? usage	1	845	September 25, 2018
NER on multilingual texts usage , ner	1	525	October 28, 2021
How is the support for Languages other than English? usage , spacy	4	3339	March 17, 2020
Support for Japanese annotation in Prodigy ner , spacy	1	912	September 2, 2019

Is it possible to let a model learn segmentation?

Related topics