Substring Selection in Front End

niko · July 10, 2020, 7:29am

Hi,
Had a search but couldn't find the answer to my query.

I am using ner.simple to try and bring out different expressions of timings in my enterprise. My hunch is that training through prodigy will be faster than a massive nightmarish regex chain.

Sadly I have hit a snag. The date portion in a string like:
( 18015)(20180323)_Form_Acceso
Is in between the second set of parentheses but I can only annotate the entire string in the UI.

Any advice on how to be able to highlight substrings?
Thanks in advance,
Niko

ines · July 10, 2020, 10:46am

Hi! By default, Prodigy will pre-tokenize your text, so your selection can "snap" to the token boundaries for faster annotation, and to make sure that spans you annotate always map to tokenization. In Prodigy v1.10+, you can enable character-based highlighting by setting --highlight-chars. You can read more about this here: https://prodi.gy/docs/named-entity-recognition#highlight-chars

However, keep in mind that if your goal is to train a named entity recognizer, your model will be predicting token-based tags later on – so your annotations need to match to valid tokens your model produces, otherwise it won't be able to learn anything from them. So if it happens a lot that the tokenization isn't precise enough, it might be worth tweaking the tokenization rules a bit.

niko · July 10, 2020, 11:07am

Perfect, thanks a lot @ines! That makes sense - I need to have a think about the tokenization pipeline. If I understand you right, then on second thought it makes more sense to be a bit more exact on the pre-processing side of things rather than offload the problem into prodigy, and end up training a model on tokens that might not occur in the actual data.
Thanks again and have a great weekend!

Topic		Replies	Views
How to make more specific selection? usage , ner	1	249	January 18, 2023
Fully manual NER annotations without tokeniser enhancement , ner , done	3	997	June 17, 2020
NER with commas in the word through ner.correct	1	381	September 12, 2022
Annotating strings without correct separation ner , best-practices	8	192	November 21, 2024
Disable automatic selection of full word when using ner.correct recipe	3	140	February 23, 2024

Substring Selection in Front End

Related topics