Hi,
Had a search but couldn't find the answer to my query.
I am using ner.simple to try and bring out different expressions of timings in my enterprise. My hunch is that training through prodigy will be faster than a massive nightmarish regex chain.
Sadly I have hit a snag. The date portion in a string like:
( 18015)(20180323)_Form_Acceso
Is in between the second set of parentheses but I can only annotate the entire string in the UI.
Any advice on how to be able to highlight substrings?
Thanks in advance,
Niko
Hi! By default, Prodigy will pre-tokenize your text, so your selection can "snap" to the token boundaries for faster annotation, and to make sure that spans you annotate always map to tokenization. In Prodigy v1.10+, you can enable character-based highlighting by setting --highlight-chars. You can read more about this here: https://prodi.gy/docs/named-entity-recognition#highlight-chars
However, keep in mind that if your goal is to train a named entity recognizer, your model will be predicting token-based tags later on – so your annotations need to match to valid tokens your model produces, otherwise it won't be able to learn anything from them. So if it happens a lot that the tokenization isn't precise enough, it might be worth tweaking the tokenization rules a bit.
Perfect, thanks a lot @ines! That makes sense - I need to have a think about the tokenization pipeline. If I understand you right, then on second thought it makes more sense to be a bit more exact on the pre-processing side of things rather than offload the problem into prodigy, and end up training a model on tokens that might not occur in the actual data.
Thanks again and have a great weekend!