Feature Engineering

Hi, how does Prodigy extract features from the text? Is it possible for me to add custom features (like length, word count, custom embedding, etc.)

You can always use a non-spaCy model in Prodigy, which would allow you to customise the features however you like. If you’re using spaCy though, unfortunately we don’t have an easy solution for adding features currently. I’ll be implementing this soon, as it came up a lot over the weekend while we were doing the spaCy trainings.

If you know you need these features, the best solution would be to implement a custom model, where you can add them. If you know length is really important for your domain, the best solution would be to have multiple models, segmented by text length. Custom embeddings can be added quite easily: you can use the spacy init-model command to convert vectors from the text-based format used by word2vec or FastText into a spaCy package, and then use that package in spacy textcat.batch-train.

Hi @honnibal, I have a few questions with regards to your response.

What do you mean by this? Is it some sort like you filter the texts according to their "length" groups, and build a model individually for each group?

Does this mean that I'm creating a blank model from scratch? Would I then be losing the existing 300-dimensional word embedding in en_vectors_web_lg?

Instead, I would like to add-on a custom feature to the existing vectorization.

Thanks!

That was one idea, yes. It wouldn't necessarily work on all problems, but if short texts behave very differently from long texts in your data, I could see it being a good approach.

Yes, using spacy init-model would replace the pretrained vectors. I've never tried to add more dimensions to the static vectors for extra features, although I suppose it could work. You can find the underlying numpy array at nlp.vocab.vectors.data. You should be able to extend it with new dimensions, for instance by using numpy.hstack. You would just need to construct an ndarray with your additional features that had the rows aligned with the original vectors. The nlp.vocab.vectors.key2row dictionary should help with that.