Hi! I am trying to train an NER model using Prodigy for annotations but I have a problem during the tokenization stage. I am using Scispacy tokenizer.
I am trying to detect kinetic parameters in scientific text that eventually look like the following:
text = "The AUC0-24h was 0.33. The AUC(0-inf) was 0.15."
AUC0-24h and AUC(0-inf) are both named entities that I would like to recognise. However, during the tokenization step, they are both divided into:
["AUC0", "-", "24h"] and ["AUC(0-inf", ")"]
Very often we will find other variations such as AUC0-12 or AUC(0-12). So I was trying to modify the tokenizer such that:
(1) whenever it finds "AUC0-" matches until whitespace and considers the whole block AUC0-? as a single token.
(2) Whenever it finds AUC( matches until ")" and considers the whole block as a token
I have been reading the documentation at https://spacy.io/usage/linguistic-features#section-tokenization but did not find an easy solution since adding special cases only works for exact matches. Do you have any advice on how this could be implemented?