Textcat with Floret Models?

I have been experimenting with Floret and I really appreciate how fast the vectorization runs and how it handles spelling variation.

Ideally, I will be using these word vectors in 2 ways:

  1. Classifying sentences with a set of labels.

  2. Loading the sentence vectors into a vector database (pg_vector or weaviate) to do KNN queries.

I would like to use Prodigy to create a text classification model built on these Floret word vectors to do the labeling. Is this possible?

The train command from Prodigy wraps around the train command in spaCy but with some helpful settings. In particular, the --base-model setting will allow you to point to a spaCy model on your disk, which could be a model that contains the floret vectors.

You can create a local model with Floret vectors with the init vectors cli command in spaCy. Once that model exists locally, you should be able to point to it from Prodigy train.

In terms of a vector database, you might also get away with using a more lightweight library instead. You may enjoy using annoy or PynnDescent. Should you choose annoy, I do recommend tuning the number of trees depending on how large your dataset is.

Let me know if you come across any issues though!

1 Like