hi @jiebei,
Great question! I don't have a perfect solution but let me provide a few additional resources that help provide more context on Matt's points above.
First this chain from Ines and Sofie have some great words of wisdom:
Also, you may find some additional inspiration from textcat
documentation on Dealing with very large label sets or hierarchical labels. This provides context on the philosophy of try to break up the task to the smallest parts possible:
If you’re working on a task that involves more than 10 or 20 labels , it’s often better to break the annotation task up a bit more, so that annotators don’t have to remember the whole annotation scheme. Remembering and applying a complicated annotation scheme can slow annotation down a lot, and lead to much less reliable annotations. Because Prodigy is programmable, you don’t have to approach the annotations the same way you want your models to work. You can break up the work so that it’s easy to perform reliably, and then merge everything back later when it’s time to train your models.
It's important to know if this is your first time training any NER model, that very likely you may find your NER entities change as you label, i.e., modify your annotation scheme for the entities.
Let's take an example I had the first time I tried to use Prodigy NER for labeling a company's internal policies to identify different internal teams/organizations. When I first started, I thought I would label groups as "ORGANIZATIONS", but as I went through more documentations, I realized that the documents had more groups that I considered "TEAMS" (that is, sub-groups of the Organizations). I realized I needed to modify my mental model as I went through. The key was that my annotation scheme changed as I labeled more and my prior beliefs, while helpful, weren't necessarily aligned with what was actually in the data.
That's why I really like Matt/Ines' suggestion is to focus on the top level of the hierarchy first, realizing that you may find after a few hundred annotations your may change. Then after you build your initial model on the high level, only then consider more specific/narrow entities.
It's worth watching Matt's 2018 PyData talk where he talks about why many NLP projects fail -- for example, they may have too ambitious plans early on and don't test/iterate enough.
He then goes into a great discussion on the "ML hierarchy of needs" to focus on the problem, then understand the annotation guidelines (aka annotation schemes), which are written instructions that are important for NLP projects to ensure that annotators have clear definitions and examples of what they are annotating. The guidelines could simply be in a Word/Google Doc, but the key is to be explicit with how you're defining your entities and iterate on the document (especially through a group discussion).
Matt also goes through a good example of framing a similar entity task and then builds up great recommendations (around 22:40) on why it's important to "worry less about scaling up (e.g., getting so many labels), and more about scaling down".
We're actively working on content to provide better case studies of how to evolve annotation guidelines to help reduce the risk of project failure. I would encourage reading my colleague Damian Romero's wonderful Tweet thread on Annotation Guidelines 101.
Also, if you want a great example of annotation guidelines, check out this GitHub repo from the Guardian newspaper, who had a great post about how they used Prodigy for an NER task to identify quote-related items. They iterated in each round of their annotations to refine their list. You can see their guidelines are detailed for even just three entities because they found quotes can be very complex. This is more reason why trying to do many entities your first time will drastically increase your likelihood of project failure.
If you make some progress, be sure to document and feel free to share any learnings back here! I know the community would great appreciate it