This tool has been awesome to get started quickly, experiment, and iterate on a fun information extraction project - thanks for your hard work!
First, I’m curious what to expect as we add more and more entities. For example, we started with PERSON
, and trained that up to really good results by itself. Adding additional entities (pre-trained or not) hasn’t had as good results yet, but I have some ideas that I’m addressing separately, with some very helpful advice in this thread. But I’m also wondering if its seem like NER should get better or worse, as more entities get added? I’m sure a lot depends on the quality of the annotations and how well the entities train independently, but curious on what to expect here at a high level.
And secondly, I think I saw some advice (sorry, can’t find where) to keep all of a project’s annotations together in one dataset, which has been my approach so far. But in my ideal scenario (from a data collection perspective), I’m starting to feel like I want to do separate annotations / dataset we built up for each entity type, and merge them together (via some workflow like db-out
-> concatenate jsonl files -> db-in
) to create a composite dataset for training. This way, if (well, when) we change our mind about how to annotate entity XYZ, we could only have that decision (and resulting re-annotation work) impact a portion of the annotations (vs. all of them). And I think what I’m reading here is that this incremental approach makes sense?