You can try to create a hierarchical label structure for either textcat
or ner
. For textcat
your labels would be categories while for ner
your labels would be entity types.
Choosing textcat
vs. ner
depends on what problem you're trying to solve. If you're starting out for your first time in Prodigy and want to do a pilot (model you may not use but want to learn Prodigy workflow), I would recommend start with textcat
and try to label the first level of your textcat
hierarchy. Try to classify it into a few (say 3-6) useful categories. You can choose whether documents can have only one category (mutually exclusive) or multiple categories (multi-label).
Label a few hundred examples, train a model, and learn a lot about your data in a day or so. Along the way, make small observations (e.g., writing down on paper) to think about the entity types you're interested in for ner
. Hopefully after this first round, you may have a better mental model of appropriate entity types for ner
.
Prodigy's documentation has a good FAQ on thinking about the pros/cons of each:
I'm not sure if NER is a good fit or if I should train a text classifier?
Named entity recognition models work best at detecting relatively short phrases that have fairly distinct start and end points. A good way to think about how easy the model will find the task is to imagine you had to look at only the first word of the entity, with no context. How accurately would you be able to tell how that word should be labelled? Now imagine you had one word of context on either side, and ask yourself the same question.
With spaCy’s current defaults (as of v2.2), the model gets to see four words on either side of each token (it uses a convolutional neural network with four layers). You don’t have to use spaCy, and even if you do, you can reconfigure the model so that it has a wider contextual window. However, if you find you’re working on a task that requires distant information to make the decisions, you should consider learning at least part of the information you need with a text classification approach.
Entity recognition models will especially struggle on problems where the annotators disagree about the exact end points of the phrases. This is a common problem if your category is somewhat vague, like “cause of problem” or “complaint”. These meanings can be expressed in a variety of ways, and you can’t always pin down the part of a sentence that expresses a particular opinion. If you find that annotators can’t agree on exactly which words should be tagged as an entity, that’s probably a sign that you’re trying to mark something that’s a bit too semantic, in which case text classification would be a better approach.
There are also several related support issues that ask related questions:
Also, if in doubt, you may be tempted to say: why don't we create a custom recipe to do both ner
and textcat
at the same time, because that would save a lot of time.
It is possible to do this using blocks
, we warn against this as in the Combining Interfaces with Blocks docs:
It’s recommended to only use the
blocks
interface for annotation tasks that absolutely require the information to be collected at the same time – for instance, comments or answers about the current annotation decision. While it may be tempting to create one big interface that covers all of your labelling needs like text classification and NER at the same time, this can often lead to worse results and data quality, since it makes it harder for annotators to focus. It also makes it more difficult to iterate and make changes to the label scheme of one of the components. You can always merge annotations of different types and create a single corpus later on, for instance using thedata-to-spacy
recipe.