Feature Engineering

yuval · July 3, 2019, 10:55am

Hi, how does Prodigy extract features from the text? Is it possible for me to add custom features (like length, word count, custom embedding, etc.)

honnibal · July 8, 2019, 4:44pm

You can always use a non-spaCy model in Prodigy, which would allow you to customise the features however you like. If you’re using spaCy though, unfortunately we don’t have an easy solution for adding features currently. I’ll be implementing this soon, as it came up a lot over the weekend while we were doing the spaCy trainings.

If you know you need these features, the best solution would be to implement a custom model, where you can add them. If you know length is really important for your domain, the best solution would be to have multiple models, segmented by text length. Custom embeddings can be added quite easily: you can use the spacy init-model command to convert vectors from the text-based format used by word2vec or FastText into a spaCy package, and then use that package in spacy textcat.batch-train.

jsnleong · July 9, 2019, 1:07am

Hi @honnibal, I have a few questions with regards to your response.

What do you mean by this? Is it some sort like you filter the texts according to their "length" groups, and build a model individually for each group?

Does this mean that I'm creating a blank model from scratch? Would I then be losing the existing 300-dimensional word embedding in en_vectors_web_lg?

Instead, I would like to add-on a custom feature to the existing vectorization.

Thanks!

honnibal · July 9, 2019, 6:46pm

That was one idea, yes. It wouldn't necessarily work on all problems, but if short texts behave very differently from long texts in your data, I could see it being a good approach.

Yes, using spacy init-model would replace the pretrained vectors. I've never tried to add more dimensions to the static vectors for extra features, although I suppose it could work. You can find the underlying numpy array at nlp.vocab.vectors.data. You should be able to extend it with new dimensions, for instance by using numpy.hstack. You would just need to construct an ndarray with your additional features that had the rows aligned with the original vectors. The nlp.vocab.vectors.key2row dictionary should help with that.

Topic		Replies	Views
Word vectors: How do they work? usage	1	1363	April 8, 2018
Do the outputted models using textcat.batch-train make use of word vectors? usage , textcat , spacy	2	560	March 28, 2019
Access to word embeddings usage , spacy	2	616	April 22, 2020
Loading fasttext vectors to spacy/prodigy ner , spacy , solved	9	1399	February 13, 2022
word embeddings for prodigy train recipe usage , spacy , training	8	509	October 24, 2022

Feature Engineering

Related Topics