Annotating strings without correct separation

toadle · November 19, 2024, 3:50pm

I still got a few questions though @magdaaniol :

1. Side-effects of more tokens
During labeling I see examples that would need even fine tokens. Especially in numbers.
Does it have any side-effects to produce even more fine-grained tokens? Especially on numbers? I imagine that when I break a number like "12345" into ["1","2","3","4","5"] that this somehow loses info?

2. tok2vec shows loss
In this post you said that the tok2vec layer is not trained. I did not fully comprehend why, but for me the training output is like this:

So I actually see a loss in the tok2vec column. Is this expected now?

3. Using custom tokenizer in ner.correct

After I trainined and save the model from the custom tokenizer as above, should I be able to used it in a ner.correct recipe without problems like this?

prodigy ner.correct dataset_net ./models_custom_tok/model-best/ corpus/my_examples.jsonl --label LABEL1,LABEL2,LABEL3

Topic		Replies	Views
revising annotation by prodigy--here only one label (DATE) usage , ner , solved	16	1931	May 20, 2019
Alignment of NER tokens when creating suggestions using Transformers ner	7	1071	August 12, 2022
data-to-spacy is not using my custom tokenizer ner , spacy	7	1095	May 15, 2023
Mismatched Tokenization on NER usage , ner	2	1140	June 25, 2021
Using Prodigy to annotate data and train a tokenizer, or to fix the default tokenizer. spacy , custom	4	1347	March 11, 2020

Annotating strings without correct separation

Related topics