You've sort of hit an edge-case here that we didn't expect. You can provide the labels as either a text file, or on the command line. To figure out which is which, we check whether the file exists. Normally this works fine, but in your case, the is-a-file check fails because the string is too long.
We can fix this in Prodigy, but you can work around it in the meantime by putting the labels into a text file (one label per line), and then pointing Prodigy at that.
Taking a step back though, I think you might want to consider annotating with far fewer labels initially. Annotating with a lot of labels will give the annotators a much more difficult experience, because they have to select the label from a long list, and remember a lot of detailed definitions. There are a few solutions:
You could have one generic category for the span-annotation phase, and divide the extracted texts into types later.
You could have one annotation task per entity type, and merge the annotations later.
You could make the label scheme hierarchical, and do the top-level categories first.
You could make the annotation sentence-based, rather than span-based.
I think option 3 is one you should especially consider. You could at least have a sentence level task where you annotate whether the sentence contains any entities. If your data has a low density of entities, this will allow you to move through the data much quicker. You could then annotate only the "has at least one entity" sentences in a subsequent pass, probably working on one entity-type at a time.
Amazing thanks! I wasnt aware we could use a text file for the labels, my bad!
I 100% agree re the number of entities, we have way too many and i know that some will not get annotated as no data will exist for those...ill be able to prove this to "management" within a few days thanks to your tool!
For the time being we have gone down the route of hierarachical, so what , why , how and when. Then i think we will look at the sentence based approach as that sounds perfect for when we go in to prod with this!
Great question! I don't have a perfect solution but let me provide a few additional resources that help provide more context on Matt's points above.
First this chain from Ines and Sofie have some great words of wisdom:
Also, you may find some additional inspiration from textcat documentation on Dealing with very large label sets or hierarchical labels. This provides context on the philosophy of try to break up the task to the smallest parts possible:
If you’re working on a task that involves more than 10 or 20 labels , it’s often better to break the annotation task up a bit more, so that annotators don’t have to remember the whole annotation scheme. Remembering and applying a complicated annotation scheme can slow annotation down a lot, and lead to much less reliable annotations. Because Prodigy is programmable, you don’t have to approach the annotations the same way you want your models to work. You can break up the work so that it’s easy to perform reliably, and then merge everything back later when it’s time to train your models.
It's important to know if this is your first time training any NER model, that very likely you may find your NER entities change as you label, i.e., modify your annotation scheme for the entities.
Let's take an example I had the first time I tried to use Prodigy NER for labeling a company's internal policies to identify different internal teams/organizations. When I first started, I thought I would label groups as "ORGANIZATIONS", but as I went through more documentations, I realized that the documents had more groups that I considered "TEAMS" (that is, sub-groups of the Organizations). I realized I needed to modify my mental model as I went through. The key was that my annotation scheme changed as I labeled more and my prior beliefs, while helpful, weren't necessarily aligned with what was actually in the data.
That's why I really like Matt/Ines' suggestion is to focus on the top level of the hierarchy first, realizing that you may find after a few hundred annotations your may change. Then after you build your initial model on the high level, only then consider more specific/narrow entities.
It's worth watching Matt's 2018 PyData talk where he talks about why many NLP projects fail -- for example, they may have too ambitious plans early on and don't test/iterate enough.
He then goes into a great discussion on the "ML hierarchy of needs" to focus on the problem, then understand the annotation guidelines (aka annotation schemes), which are written instructions that are important for NLP projects to ensure that annotators have clear definitions and examples of what they are annotating. The guidelines could simply be in a Word/Google Doc, but the key is to be explicit with how you're defining your entities and iterate on the document (especially through a group discussion).
Matt also goes through a good example of framing a similar entity task and then builds up great recommendations (around 22:40) on why it's important to "worry less about scaling up (e.g., getting so many labels), and more about scaling down".
We're actively working on content to provide better case studies of how to evolve annotation guidelines to help reduce the risk of project failure. I would encourage reading my colleague Damian Romero's wonderful Tweet thread on Annotation Guidelines 101.
Also, if you want a great example of annotation guidelines, check out this GitHub repo from the Guardian newspaper, who had a great post about how they used Prodigy for an NER task to identify quote-related items. They iterated in each round of their annotations to refine their list. You can see their guidelines are detailed for even just three entities because they found quotes can be very complex. This is more reason why trying to do many entities your first time will drastically increase your likelihood of project failure.
If you make some progress, be sure to document and feel free to share any learnings back here! I know the community would great appreciate it
Thank you so much, Ryan! I guess a related question is shall we use text classification recipe instead of ner if we are going to implement a hierarchical label structure?
You can try to create a hierarchical label structure for either textcat or ner. For textcat your labels would be categories while for ner your labels would be entity types.
Choosing textcat vs. ner depends on what problem you're trying to solve. If you're starting out for your first time in Prodigy and want to do a pilot (model you may not use but want to learn Prodigy workflow), I would recommend start with textcat and try to label the first level of your textcat hierarchy. Try to classify it into a few (say 3-6) useful categories. You can choose whether documents can have only one category (mutually exclusive) or multiple categories (multi-label).
Label a few hundred examples, train a model, and learn a lot about your data in a day or so. Along the way, make small observations (e.g., writing down on paper) to think about the entity types you're interested in for ner. Hopefully after this first round, you may have a better mental model of appropriate entity types for ner.
I'm not sure if NER is a good fit or if I should train a text classifier?
Named entity recognition models work best at detecting relatively short phrases that have fairly distinct start and end points. A good way to think about how easy the model will find the task is to imagine you had to look at only the first word of the entity, with no context. How accurately would you be able to tell how that word should be labelled? Now imagine you had one word of context on either side, and ask yourself the same question.
With spaCy’s current defaults (as of v2.2), the model gets to see four words on either side of each token (it uses a convolutional neural network with four layers). You don’t have to use spaCy, and even if you do, you can reconfigure the model so that it has a wider contextual window. However, if you find you’re working on a task that requires distant information to make the decisions, you should consider learning at least part of the information you need with a text classification approach.
Entity recognition models will especially struggle on problems where the annotators disagree about the exact end points of the phrases. This is a common problem if your category is somewhat vague, like “cause of problem” or “complaint”. These meanings can be expressed in a variety of ways, and you can’t always pin down the part of a sentence that expresses a particular opinion. If you find that annotators can’t agree on exactly which words should be tagged as an entity, that’s probably a sign that you’re trying to mark something that’s a bit too semantic, in which case text classification would be a better approach.
There are also several related support issues that ask related questions:
Also, if in doubt, you may be tempted to say: why don't we create a custom recipe to do both ner and textcat at the same time, because that would save a lot of time.
It’s recommended to only use the blocks interface for annotation tasks that absolutely require the information to be collected at the same time – for instance, comments or answers about the current annotation decision. While it may be tempting to create one big interface that covers all of your labelling needs like text classification and NER at the same time, this can often lead to worse results and data quality, since it makes it harder for annotators to focus. It also makes it more difficult to iterate and make changes to the label scheme of one of the components. You can always merge annotations of different types and create a single corpus later on, for instance using the data-to-spacy recipe.