I was looking on the internet for a text classification and annotation tool that can help my classifiers perform better when tagging the text with multiple labels.
My dataset is a csv with each row representing a sample that needs to be classified according to text that resides in multiple columns. Up until now we did this with excel (horribly inefficiant) and now we want to move forward and use a tool that can present every sample on the screen and enable classifing with mulitple labels and annotation of specific words in the text, that are indicative for each label be chose.
I hope i explained clearly enough my objective. I am here for support because i tried your demo on Name Entities (manual) which at first seemed exactly like what i need, but i am not sure if i can use multi labels that doesnt include “Accept” , “Reject” etc, but a more verbal kind of labels.
I was looking at other tools like Dataturks, and knowtator, for reference.
Sure, that should be no problem! Prodigy supports CSV out-of-the-box, and the textcat recipes should be exactly what you need. Here are some more details and examples:
You can also write your own custom recipes and build different types of interfaces – for example, using the "choice" view, which lets you add multiple choice options to your data. See here for an example with code.
The --patterns argument of textcat.teach lets you provide examples of phrases that are likely indicative of a label. This can help you pre-select examples if you're working with a large corpus of text.
In general, I'd recommend trying to break down the task into smaller pieces wherever possible. So, for example, you usually don't want to do NER and text classification annotations at the same time and instead, focus on one simple and ideally binary decision.
The Prodigy philosophy is also very experiment-focused (especially during development) – so after an annotation session, it's often very useful to train a model, compare it to previous results and see whether the new data improves the model.
By applying this feature, can i choose to label by multiple choices or is it only one selected out of many? like your example to choose from “happy”,“sad”,“angry”,“neutral”, my labels can overlap like “sunny day”,“very humid”,“cold temprature”, where the weather can be sunny and humid at the same time.
We are actually not using spaCy and we expect to get output on every NER and text annotation in the form of a json. By combining the two I want to mark the data that helped me get the decision. for example, in your example for classifying words of insults, if the word “asshole” is in the text that I classify, and it is the reason I classified the text as an insult, I would like to tag it as indicative but not necessarily use it to influence the model right away.
I hope my explenation is clear enough,
Yes, you can set "choice_style": "multiple" in your recipe config to allow multiple selections. The collected annotations will then contain a list of all accepted option IDs, for example: "accept": ["SUNNY", "HUMID"].
Yes, that makes sense! If you’re planning on exporting the data and training afterwards, you can focus on using Prodigy as an “annotation workflow builder”, to create different interfaces to collect the exact data you need as efficiently as possible. Once you’re done annotating, you can export a dataset to a JSONL file:
Regarding choice multiple selection, if I have 25 different choices, you think the UI can handle it?
And for that matter, I think we miss understood each other, I think that if I can make an annotation recipe that enables me to annotate word entities (like the NER (manual) in your demo), and add multiple choices that would work.
Basically what I want is to classify a textual sample like “Insult” or “Praise”, and also annotate the words that specifically helped me, as a human classifier, know how to do it. Then, I can use these words later to teach the model. words such as “asshole” that cannot be interpreted as as “not insult”, I am 100% sure it is used as an insult.
It can definitely handle it – but if possible, I’d always recommend working with smaller label sets. Annotation becomes more difficult and less efficient if the annotator has to focus on too many different concepts at the same time. Instead, it’s often better to start with a smaller scheme and expand it later, or make several passes over the data, one for each label or concept.
For example, let’s say you’re classifying news articles and you have the categories politics, football, tennis and basketball. You could start by classifying whether the text is about SPORTS or not, and afterwards, take all texts labelled as SPORTS and classify the type of sports they’re about. You’ll have to make more than one pass over the data, but it can significantly reduce human error and it’ll allow you to refine the label scheme as you develop the data (so you won’t have to decide on all the specifics upfront and annotate the whole dataset again if you make a change).
Thanks, I think I understand! You should definitely do this in two steps, though: first, label if a text is about an insult, and then take all insult texts and highlight the trigger words using the manual interface.