Custom ner.manual with different label sets

Is is possible to display different sets of labels based on the meta value in stream?

I have a number of paragraphs that I would like to parse. Each of the paragraphs belongs to one of 8 different classes. Is it possible to display different set of labels based on the class to which the paragraph belongs? Or would I need to create 8 different recipes?

Also, if the above is possible, is there a way to save the answers into 8 different datasets or does it always have to save to the same one?

Thanks

I think in your case, it'd probably be better and more efficient to start multiple sessions. Each session can have its own source file, label set and dataset that the annotations are saved to. For example:

prodigy ner.manual dataset1 en_core_web_sm data1.jsonl --label A,B,C
prodigy ner.manual dataset2 en_core_web_sm data2.jsonl --label D,E,F
# and so on...

This will likely also make the annotation process more pleasant – if both the text and the label scheme change on each example, your brain has to refocus constantly, and might introduce more human errors, too.

Thank you for your reply.

Is there any way I can start all of them at the same time? One person is writing the recipes and some other team members are doing labeling so it would be useful if it could all be started at the same time, or combined in one recipe.

Thanks!

Yes, if you want to do it all on one machine, you could just spin up each session on a different port, by setting the PRODIGY_PORT environment variable. For example:

PRODIGY_PORT=8080 prodigy ner.manual dataset1 en_core_web_sm data1.jsonl --label A,B,C
PRODIGY_PORT=1234 prodigy ner.manual dataset2 en_core_web_sm data2.jsonl --label D,E,F

The first Prodigy server will then be started on localhost:8080, and the second session on localhost:1234. If you’re running it yourself in your terminal, you’ll either need a new terminal session per command (e.g. a new window or tab), or use something like tmux. You can also make the whole thing more elegant by wrapping it in a shell script so you only have to run one command to start everything.

Thank you for your reply.

I’m still wondering if there is anyway it can all be combine to one recipe, e.g. label is generated based on the properties of each task in stream?

No, by design, Prodigy expects you to define one label set per annotation session.

The label scheme is one of the most important parts of an NLP application and model, so allowing too much arbitrary variance during annotation can easily lead to many other problems down the line. Similarly, changing the annotation objective completely on a per task-basis really goes against Prodigy's UX philosophy and I can't think of many cases where this would be beneficial compared to an approach that uses multiple, dedicated sessions per label scheme.

That said, I understand that there are always exceptions and special use cases where you might want to do things differently. I still think doing 8 concurrent sessions makes the most sense here – especially since you do want to save the annotations to different datasets as well. Running concurrent sessions is no problem out-of-the-box and can be easily automated.

Do you have an example of one of those classes and the corresponding labels? (If you can't share the exact details, maybe you can come up with a similar-ish example?)

Hi, I'm having a similar use-case as well, the difference being that I have about 2000 text snippets that could have entities from 1000 classes but I know the subset of labels (<=10 labels) that could be possible for each of the entities. Spinning up different sessions is not feasible for each of them. In this case, could you please suggest good practices with Prodigy.

hi @Nikhil-Kasukurthi!

Thanks for your question and welcome to the Prodigy community :wave:.

I'm not sure I fully understand your terminology.

  • What's the difference between classes and labels?
  • How do you know the subset of labels? Does it come in a flat file?
  • Does each document (aka "snippet") have its own set of entity types?
  • Are there any groups of documents that have the same entity types?
  • Can you give an example?

Unfortunately, as Ines mentioned, Prodigy wasn't designed for different label sets per annotation session. My first recommendation would be to partition your data by documents that have similar entity types (e.g., X docs with the same entities).

But let's step back for a minute. Even if you were able to label 1000 entity types -- what's your expectation for model training? Do you expect a model to train 1000 entity types with only about 2000 documents?

At minimum we'd expect a model to need a few hundred instances (say 500) of each entity type. If we assume each document had 2 entity types, that may yield something like 4 examples per instance, which would not be sufficient instances per entity type.

I may be missing something so I'll see if you can correct me on your use case to see if we can figure out some alternative.