Add tokenization rule

Hi! I am trying to train an NER model using Prodigy for annotations but I have a problem during the tokenization stage. I am using Scispacy tokenizer.

I am trying to detect kinetic parameters in scientific text that eventually look like the following:
text = "The AUC0-24h was 0.33. The AUC(0-inf) was 0.15."

AUC0-24h and AUC(0-inf) are both named entities that I would like to recognise. However, during the tokenization step, they are both divided into:

["AUC0", "-", "24h"] and ["AUC(0-inf", ")"]

Very often we will find other variations such as AUC0-12 or AUC(0-12). So I was trying to modify the tokenizer such that:

(1) whenever it finds "AUC0-" matches until whitespace and considers the whole block AUC0-? as a single token.

(2) Whenever it finds AUC( matches until ")" and considers the whole block as a token

I have been reading the documentation at https://spacy.io/usage/linguistic-features#section-tokenization but did not find an easy solution since adding special cases only works for exact matches. Do you have any advice on how this could be implemented?

Many thanks,

Ferran

Hi! If those types of constructions are common in your data, it might be a good idea to just make your tokenization a lot "stricter" overall and add rules that always split on -, ( , ) etc. This means you'll end up with more fine-grained tokens per entity span, but that typically doesn't matter that much. What matters is that you're able to represent the entities you're after as token-based tags :slightly_smiling_face:

Hi Ines, thank you very much for the quick reply.

Your suggestion makes a lot of sense for the NER task I mentioned.

I guess the reason behind trying to represent AUC(0-inf) and AUC0-24 as single tokens is that I was going to train my own word2vec model and look at clusters of kinetic parameters within my vocabulary. This would be difficult if AUC0-24 is split into 3 tokens since it wouldn't appear as a single token in the vocabulary and would be hard to look for "most similar" kinetic parameters. But I understand this is a different task and perhaps it would be better to use different tokenization rules for each of the tasks.

I was just wondering if there is an easy way in spacy to define a regex pattern that, if it matches, it keeps the whole span as a single token.

Thanks once again and congratulations on your work

Ah okay, that makes a lot of sense :slightly_smiling_face:

Yes, there is the token_match setting, actually. You can see an example here: https://spacy.io/usage/linguistic-features#native-tokenizers

By default, this is used to match URLs and preserve them as one token. When you implement your regex, just keep an eye on performance. If you regex is inefficient, it can potentially make tokenization very slow (and it may be confusing to debug if you don't know what's happening).

Thanks a lot, Ines, this seems like a very useful feature and I ended up implementing it this way.

As far as I understand, the token_match setting is applied after the suffixes and prefixes have been checked. I presume this works well for most examples, but for instance, the tailing parenthesis from AUC(0-inf) would be split in the first place since it's specified to be a suffix. The same happens with other kinetic parameters eventually written as C(max) or T(max) in the scientific literature. This is not a problem for my application since they will always be formed by entities of 2 tokens. But I thought there might be a benefit on having the option to specify immutable substrings matching a regular expression just after whitespace splitting in the Tokenizer class.

In any case, your advice is more than enough for my purposes so thanks a lot, I very much appreciate it.