Dep.Teach doesn't use same tokenenization as pretrained model

ines · March 6, 2020, 12:29pm

@kak-to-tak How is your custom tokenizer implemented? Prodigy will use the model's nlp.make_doc method to create a tokenized Doc from the string of text. By default, this will call into nlp.tokenizer. So your custom tokenization should be implemented via the model's tokenizer.

Alternatively, you can also feed in pre-tokenized data that has a "tokens" property. See here for an example of the format: https://prodi.gy/docs/api-interfaces#dep

Topic		Replies	Views
Basic question about Prodigy annotations and model training. usage , ner	12	765	January 18, 2019
Merge Entities Error done , spacy , terms	13	4027	August 26, 2019
Labeling sequence labeling (e.g. NER) task from scratch ner , spacy	16	3499	October 22, 2017
Training a grammar tool usage , textcat	24	5621	February 26, 2018
Roadmap of having a unified model for tokenizing, NER and dependency parsing using Prodigy ner , spacy , custom , training	1	427	July 7, 2023

Dep.Teach doesn't use same tokenenization as pretrained model

Related topics