- I have a corpus with about 2 million sentences in jsonl format. As it is suggested to have larger corpus for the ner.teach recipes, is this a proper size or should we increase or decrease it?
There are other alternative corpuses with about 200000 sentences which I can add to or use instead of this one.
- In addition I have about 55000 patterns (converted directly from list of words). Should I drop some of these, or the more is better for the ner.teach active learning model?
I again have smaller patterns files with about 2000 patterns which I can add to or use instead of this one.
Yes, this sounds good. If the file can be read in line-by-line (e.g. JSONL), Prodigy will process it as a stream, so you never have to load the whole thing into memory. This makes it easy to work wit large corpora and potentially infinite streams. If your file gets too big, you can always split it into multiple chunks if needed.
55k patterns are okay, too, and the matching shouldn't have a big impact on speed and memory consumption at this point. That said, you could consider pruning the patterns a bit (e.g. by frequency) or start with the smaller file first and see how you go. The patterns are only going to be used to find more positive examples, in addition to what the model already suggests.
Okay, I'll do so. Many thanks.