Hello,
I am using Spacy and my dataset is a list of emails which as part of pre-processing, I cleaned up the data by removing stop words, disclaimers and greeting and each email belongs to a category and in total I have about 30 categories.
Now my question is how can I add my list of categories to the pipeline ?
Would it be like this ?
> text_cat=nlp.create_pipe( "textcat", config={"exclusive_classes": True, "architecture": "simple_cnn"}) > nlp.add_pipe(text_cat, last=True) > > text_cat.add_label("Cat1") > text_cat.add_label("Cat2") > . > . > .
and for load_data
function, what would be my cats ? I dont what should the format of the cats
to be
In Spacy docs/examples:
def load_data(limit=0, split=0.8):
"""Load data from the IMDB dataset."""
# Partition off part of the train data for evaluation
train_data, _ = thinc.extra.datasets.imdb()
random.shuffle(train_data)
train_data = train_data[-limit:]
texts, labels = zip(*train_data)
cats = [{"POSITIVE": bool(y), "NEGATIVE": not bool(y)} for y in labels]
split = int(len(train_data) * split)
return (texts[:split], cats[:split]), (texts[split:], cats[split:])
In my case, my categories are String, so I dont know what would be right value for me for cats
in this linecats = [{"POSITIVE": bool(y), "NEGATIVE": not bool(y)} for y in labels]
I tried adding my list of categories to cats
in load_data
function, but the code reaches to training the data, in this line:
nlp.update(texts, annotations, sgd=optimizer, drop=0.2, losses=losses)
I get this error:
ValueError: could not convert string to float: 'Cat1'
I could only find examples for POSITIVE and NEGATIVE and with 0 and 1 as categories