source and pattern file size for ner.teach

az1373 · November 25, 2020, 9:41am

I have a corpus with about 2 million sentences in jsonl format. As it is suggested to have larger corpus for the ner.teach recipes, is this a proper size or should we increase or decrease it?

There are other alternative corpuses with about 200000 sentences which I can add to or use instead of this one.

In addition I have about 55000 patterns (converted directly from list of words). Should I drop some of these, or the more is better for the ner.teach active learning model?

I again have smaller patterns files with about 2000 patterns which I can add to or use instead of this one.

ines · November 26, 2020, 12:19am

Yes, this sounds good. If the file can be read in line-by-line (e.g. JSONL), Prodigy will process it as a stream, so you never have to load the whole thing into memory. This makes it easy to work wit large corpora and potentially infinite streams. If your file gets too big, you can always split it into multiple chunks if needed.

55k patterns are okay, too, and the matching shouldn't have a big impact on speed and memory consumption at this point. That said, you could consider pruning the patterns a bit (e.g. by frequency) or start with the smaller file first and see how you go. The patterns are only going to be used to find more positive examples, in addition to what the model already suggests.

az1373 · November 26, 2020, 8:55am

Okay, I'll do so. Many thanks.

Topic		Replies	Views
Can you increase the question batch size in ner.teach active learning? usage , ner	1	465	June 21, 2020
Prodigy crashes on large documents ner , spacy	1	1104	January 16, 2018
Prodigy NER train recipe getting killed by OOM usage , ner	5	1234	June 14, 2022
Create a dataset out of many txt_files documents (Best Practice) usage , ner , best-practices	4	1802	March 30, 2021
Surfacing sentences for annotation - Entity sparsity usage , ner , spacy	8	319	April 20, 2022

source and pattern file size for ner.teach

Related topics