Add tokenization rule

fgh95 · May 14, 2020, 6:07pm

Hi! I am trying to train an NER model using Prodigy for annotations but I have a problem during the tokenization stage. I am using Scispacy tokenizer.

I am trying to detect kinetic parameters in scientific text that eventually look like the following:
text = "The AUC0-24h was 0.33. The AUC(0-inf) was 0.15."

AUC0-24h and AUC(0-inf) are both named entities that I would like to recognise. However, during the tokenization step, they are both divided into:

["AUC0", "-", "24h"] and ["AUC(0-inf", ")"]

Very often we will find other variations such as AUC0-12 or AUC(0-12). So I was trying to modify the tokenizer such that:

(1) whenever it finds "AUC0-" matches until whitespace and considers the whole block AUC0-? as a single token.

(2) Whenever it finds AUC( matches until ")" and considers the whole block as a token

I have been reading the documentation at https://spacy.io/usage/linguistic-features#section-tokenization but did not find an easy solution since adding special cases only works for exact matches. Do you have any advice on how this could be implemented?

Many thanks,

Ferran

ines · May 14, 2020, 6:20pm

Hi! If those types of constructions are common in your data, it might be a good idea to just make your tokenization a lot "stricter" overall and add rules that always split on -, ( , ) etc. This means you'll end up with more fine-grained tokens per entity span, but that typically doesn't matter that much. What matters is that you're able to represent the entities you're after as token-based tags

fgh95 · May 14, 2020, 7:01pm

Hi Ines, thank you very much for the quick reply.

Your suggestion makes a lot of sense for the NER task I mentioned.

I guess the reason behind trying to represent AUC(0-inf) and AUC0-24 as single tokens is that I was going to train my own word2vec model and look at clusters of kinetic parameters within my vocabulary. This would be difficult if AUC0-24 is split into 3 tokens since it wouldn't appear as a single token in the vocabulary and would be hard to look for "most similar" kinetic parameters. But I understand this is a different task and perhaps it would be better to use different tokenization rules for each of the tasks.

I was just wondering if there is an easy way in spacy to define a regex pattern that, if it matches, it keeps the whole span as a single token.

Thanks once again and congratulations on your work

ines · May 14, 2020, 7:23pm

Ah okay, that makes a lot of sense

Yes, there is the token_match setting, actually. You can see an example here: Linguistic Features · spaCy Usage Documentation

By default, this is used to match URLs and preserve them as one token. When you implement your regex, just keep an eye on performance. If you regex is inefficient, it can potentially make tokenization very slow (and it may be confusing to debug if you don't know what's happening).

fgh95 · May 15, 2020, 12:31am

Thanks a lot, Ines, this seems like a very useful feature and I ended up implementing it this way.

As far as I understand, the token_match setting is applied after the suffixes and prefixes have been checked. I presume this works well for most examples, but for instance, the tailing parenthesis from AUC(0-inf) would be split in the first place since it's specified to be a suffix. The same happens with other kinetic parameters eventually written as C(max) or T(max) in the scientific literature. This is not a problem for my application since they will always be formed by entities of 2 tokens. But I thought there might be a benefit on having the option to specify immutable substrings matching a regular expression just after whitespace splitting in the Tokenizer class.

In any case, your advice is more than enough for my purposes so thanks a lot, I very much appreciate it.

Topic		Replies	Views
Custom Tokenization Support for Spacy (and by extension Prodigy). spacy	3	1746	January 24, 2019
Custom Tokenizer help ner , spacy	1	320	December 23, 2022
Guidance on how to add tokenizer rule spacy , solved	3	3384	July 3, 2018
Custom English Tokenizer usage , spacy	0	533	May 7, 2019
Help with tokenization numbers with units of measure usage , ner , spacy	3	2850	August 6, 2018

Add tokenization rule

Related topics