Do the seed values dynamically update while annotating, or the seed set remains the same all along?
Yes, the target vector is updated while you’re annotating. When you start the recipe, the seed terms are used to generate the initial target vector – e.g. the average of “apple”, “pear” and “banana”. Prodigy then updates that with the examples you accept, and also keeps a vector average for the terms you reject.
Is this functionality implemented in terms.py:teach
?
Yes, it's all in the source. The terms.teach
recipe is actually pretty standalone and doesn't depend on any Prodigy internals or models. It only uses the built-in sorters and, of course, spaCy. The interesting parts are the accept_score
and reject_score
, and the accept_doc
and reject_doc
.
I’m thinking about how to implement similar functionality for sentences. Given a set of seed sentences that are similar in some way, I want to pick out sentences from a corpus that are the most similar to the seeds’ meaning. This is a “more like this” task.
It may be the case that none of the seed sentences appear in the corpus, so they cannot be used to initialize matching patterns for an textcat.teach
annotation task. Instead it seems like the way to do it is to follow the example of terms.teach
: find the centroid of the vector representations of the seeds, sort the sentences in the corpus by cosine similarity to this average, and have the annotator mark them as same or not the same, adjusting the centroid by the vectors of the accepted sentences. The final “more like this” answer would be the top n most similar sentences in the corpus to the shifted centroid.
- Does this seem reasonable, or should I instead be finding a way to train a text classification model?
- If I do want to do this, is
terms.teach
a good algorithm to base it on?
To help you search literature, the academic term for this sort of thing is “one shot learning” (also zero shot learning, few shot learning, etc).
Put glibly: the nice thing about one shot learning is you don’t need to have examples to train a class-specific model. The downside is it doesn’t work very well.
To answer your questions:
-
Yes, it’s very reasonable to try the sentence similarity.
-
Yes,
terms.teach
is a good template.
The trick is going to be in the similarity method, or more specifically the vectorization method. The default doc.similarity()
method is enough to get started, but it’s just averaging the word vectors within the documents. So, you’re going to end up with a measure that’s equivalent to dumping all words in your “accept” document into one set, and then comparing against that set.
I’m sure you’ll get better results by at least including some stop-list sort of logic, so that you’re only considering the vectors for relevant content words. You can then go one step fancier and only look at a few syntactic roles, e.g. just look at the verbs and nouns, or even the root verb and its subject and object if you’re doing intent detection. Even fancier is to try to learn weights by syntactic roles, so that the model can learn verbs are more important, iff verbs actually are more important.
Another family of ideas is to build a matrix of pairwise comparisons over the words. This allows you to add a low-pass filter on the similarities. The intuition is that non-match of some word against some other word means very little. What matters is how many strong matches you hit. The best-performing version of this idea is the Word Movers Distance, which is implemented within Gensim. Another famous application of the same insight is LexRank, which thinks in terms of a weighted graph, and uses Pagerank to find the centroid.
Another very good line of attack is to compute a word alignment over the sentences, using various features. This has demonstrated great results in the Semeval Semantic Textual Similarity challenges. For a long time, I’ve wanted an implementation of Sultan’s 2015 rule-based system for spaCy. This repo looks promising: https://github.com/FerreroJeremy/monolingual-word-aligner
Finally, there are end-to-end neural network approaches. This is the hot tactic in the literature, but after doing a lot of work on these, I’m not entirely convinced. If you’re using the NN to look at the sentence in isolation and boil it down into a vector, you’re making your model understand the text in general – this is a much harder subproblem than the one you’re really trying to solve. So doing things this way takes a lot of data.
My recommendation would be to start by just customising the .vector
method, by writing a function to doc.user_hooks['vector']
. Then you could also try a custom similarity method. This pairwise function could be a good place to start:
import numpy as np
def cosine_similarity(vec1, vec2):
return vec1.dot(vec2)
def row_max_similarity(doc1, doc2, sim_metric=None):
if sim_metric is None:
sim_metric = cosine_similarity
doc1 = [t for t in doc1 if t.has_vector]
doc2 = [t for t in doc2 if t.has_vector]
N1 = len(doc1)
N2 = len(doc2)
similarities = np.zeros((N1, N2))
for i in range(N1):
vec1 = doc1[i].vector / doc1[i].vector_norm
for j in range(N2):
vec2 = doc2[j].vector / doc2[j].vector_norm
similarities[i, j] = sim_metric(vec1, vec2)
flows1 = similarities.argmax(axis=0)
flows2 = similarities.argmax(axis=1)
sim = (similarities.max(axis=0).sum() + similarities.max(axis=1).sum()) / (N1+N2)
return sim, flows1, flows2
Finally, if you want to invest more effort in this, have a look at the word alignment approach. I think it’s really a good way to go.
I’d like to have a recipe for this in Prodigy — if we can figure out an approach that works. So, please keep us updated