Change the tokenizer after the annotation?

I have already annotated some data for a NER. Is it possible to change the tokenizer after the annotation is done and keep the annotations?

This is tricky territory because it might be the case that some of your NER annotations no longer fit the new tokens. Could you elaborate more on your situation? Why do you need to change your tokeniser?

I would like to use a NER in the chemical field. An adapted tokenizer, which for example does not tear apart chemical formulas and treats them as an independent entity, should perform better. However, I am also quite new in the field and could only get this information from the literature.

So theoretically, you can't assume that you can change the tokenizer without consequences. However, if the retokenization is well understood, you might be able to write a Python script to handle it.

Do you have an illustrative example by any chance?

This publication discusses the topic in more detail. Here you can also see an example. A tokenizing according to the rules of the ChemDataExtractor would be very helpful, because it is often used in the literature. Here are suitable links to it:

Thanks for your help!

I glanced at the examples, and I agree it's an interesting problem. A big part of the issue seems that there's not a perfect standard for writing chemical bonds in text.

If you have a spaCy tokenizer at the ready, I might be able to help think along, but I think it would be good to collect a dataset to evaluate your custom tokeniser. You can set up ner.manual to use character based annotation which can help a lot here.

I'm also wondering if it might be more useful to have a system where you first detect the chemical reference in the text and that you then parse it separately. I'm not familiar enough with the domain to judge if this is practical though.

1 Like

Here is one way to implement the rules with a SpaCy tokenizer: cde2.1-ner-supplementary/alt_tokenizers.py at master · ti250/cde2.1-ner-supplementary · GitHub

However, I am not able to implement this with my already annotated data. My text does not contain many chemical formulas. However, it would still be very good if the rules of this tokenizer could be applied.

@koaning Do you think it is possible with the code from GitHub from my post above?

hi @yllwpr!

@koaning is out for a few weeks on parental duties.

I glanced at the problem and I think that GitHub link is promising to implement the rules of ChemDataExtractor. You mentioned you haven't been able to try it out since your data doesn't have any chemical formulas. Have you tried to see if it works one a few test cases (e.g., examples provided in the link you previously sent)? Are you having any other problems?

If you're having issues thinking about spaCy tokenizers, you may also want to consider posting on the spaCy community discussion site. The spaCy team answers posts from that site and they have a lot experience with tokenizers.

1 Like

Thanks for your feedback. Yes I was able to test it. It's now a matter of converting the already annotated data to the new tokenizer. Maybe I will ask about this in the SpaCy community. Thanks for the help!