Hello! I am working on text classification with a hierarchical schema. No I am trying to pass the classified text obtained from top level to the second level with a custom recipe. I am following this post,
The problems that I have now are
I need to use textcat.manul, with the options to select for the 2nd level classification
No labels from top has been passed, and no options on the 2nd level is shown
I am not sure what I have missed in my scripts. Thank you so much for your thoughts!
hierarchy = {'RESEARCH': ['find a book/article', 'Citations', 'Databases/electronic resources','ILL/EZborrow','Data'], 'NON-RESEARCH': ['Room reserve/Building spaces/directions', 'Printer/Scanner/Technical']}
def get_stream(examples):
for eg in examples: # the examples with top-level categories
top_labels = eg['accept'] # ['A'] or ['B', 'C'] if multiple choice
for label in top_labels:
sub_labels = hierarchy[label]
options = [{'id': opt, 'text': opt} for opt in sub_labels]
# create new example with text and sub labels as options
new_eg = {'text': eg['text'], 'options': options}
yield new_eg
import prodigy
from prodigy.components.db import connect
from prodigy.components.loaders import JSONL
@prodigy.recipe(
"custom_textcat_2ndlevel",
dataset=("Dataset loader TEXTCAT annotations from", "positional", None, str),
source=("The source data as a JSONL file", "positional", None, str),
)
def custom_recipe(dataset, source):
stream = JSONL(source)
stream = get_stream(stream)
return {
'view_id': 'classification', # Annotation interface to use
'dataset': dataset, # Name of dataset to save annotations
'stream': stream, # Incoming stream of examples
}
I got text without any labels from top or second levels.
My prodigy commend is
Hello @jiebei,
Thank you for your question!
The problem you describe results from a wrong chosen view-id. The classification interface is a binary interface, i.e., prodigy renders one label at the top of the task such that the user can choose whether this label applies to the text/image shown. However, you need the choice interface to get the options displayed. The dictionary your recipe returns should therefore look like this:
I do have a follow-up question. After I complete the textcat.mannual and want to move to the textcat.correct with the same hierarchical schema, I am not sure how I can modify the code to achieve this. Could you share your thoughts on that? Do you happen to have any past Q&A posts that address a similar issue? Thanks again!
Have you thought about how/when you will combine your different textcat components into one spaCy pipeline (model)?
This is important as to use a correct recipe, you'll need an existing model. Was your expectation that you would have two different correct recipes -- one for level one and another for level two?
This is likely the easiest approach and I would suggest to do this. For this, you would continue annotating/training each level separately until you're done. Then you'll need to assemble the two textcat models into one pipeline.
correct recipes: one for each level
To create a correct recipe for each level, check out the Prodigy recipe repo, specifically the code for the textcat_correct recipe.
You can add the get_stream function you currently have which will add the options per the appropriate labels.
The add_suggestions which provides your existing models' predictions. I think you may not need to modify this. You can keep or remove the additional function for update which is for "incremental learning".
Combining two textcat models into one spaCy pipeline
This is a bit tricky as mentioned here:
They key would be to train (e.g., prodigy train) two different textcat models independently. This would mean you have two different models, each with a respective folder. Let's call one textcat_1 and the second as textcat_2. Each of those folders will have two sub-folders: model-best and model-last.
When you're good with these two models' performance and want to combine, you'll need to follow these instructions to source and assemble the two models using one new config.cfg.
Following those instructions and starting in the same folder where you have the two model (folders) of textcat_1 and textcat_2, save this file as your combine model's config file:
This is like the default config.cfg output for textcat when using prodigy train. However, it includes two textcat components: textcat1 and textcat2.
The source provides where the original components are sourced from. In this example, I take the model-best from each of the two components. You can change this to model-last if you prefer the most recent model run instead of the best one.
With this new file, you'll then need to run spacy assemble using that config.cfg file:
This will create a new folder called combined_model. It's important to note that then you can run/process that model as you would a normal spaCy model:
import spacy
nlp = spacy.load("complete_model")
doc = nlp("This is a test sentence to score.")
doc.cats
{'LABEL1': 0.16814707219600677, 'LABEL2': 0.7714870572090149, 'LABEL3': 0.06036587804555893, 'LABEL1_A': 0.028656000271439552, 'LABEL1_B': 0.10963789373636246, 'LABEL2_A': 0.6855509877204895, 'LABEL2_B': 0.049830008298158646, 'LABEL3_A': 0.08758322149515152, 'LABEL3_B': 0.0387418232858181}
Sorry for getting back late and thanks for the detailed instructions! A quick follow up question, I am curious how the hierarchy is realized in this customized script, shall I place my top level textcat model as textcat1, and second level as textcat2 to realize that? Thank you!
In this case I don't think it matters as they are independent of one another. Order does matter in spaCy pipelines if one component is used as an input in another. Here's a good FAQ in spaCy that explains it.