Hi,
For a NER project with only 2 entities starting from scratch (not using existing entities) I have so far created 3 datasets:
1: 480 annotations using ner.manual of ONLY entity A
2: 300 annotations using ner.manual of ONLY entity B (only 80 sentences overlapped with 1)
3: 1150 binary annotation using ner.teach of ONLY entity B, using the model created by training on dataset 1 and 2 (all settings default without any flags)
Needles to say with so little annotations I want to spend more time annotating. I also noticed binary annotating was significantly (>5x) faster than manual (even with patterns).
My questions are:
- How should I proceed labeling? It seems most time-efficient to collect more binary annotations using ner.teach for entity A, then more for B (but not both at the same time). Should they ideally be on the same sentences?
- Can having such large separate binary datasets for A and B be the end point, or should I convert them to gold via silver-to-gold or ner.correct (which destroys the efficiency gain of binary annotations)?
- How should I perform both final and intermediate training?
** Should I simply train the base (NER-less) model with all datasets (binary and manual), or first a model with the manual, then update that with the binary?
** Given that I want to perform binary annotations of only 1 entity at the same time, should I periodically retrain and switch entities to prevent the model becoming one-sided?
** which flags to use when? (i can at least choose -B, -NM, -E and a combination)
Sorry if these questions seem basic, I could figure out the answers for the simple case of a pure GOLD + Binary dataset but since my datasets are quite separate/different (no true GOLD exists since datasets 1 and 2 are not overlapping), and I want to do binary labeling for A and B separately, the answers might be different.
Thanks a lot in advance!