I am working on single-label multi-class text classification. As I don't want to annotate the very same labels again and again (and I am a newb so I can't write a stream filtering func for both task hash and keys either), I code each of my labels separately, into separate dbs (as recommended) that I later plan to merge together.
Since I want to gather the relevant positive examples as fast as possible, I use
prefer_high_scores(predict(stream)). The expected behaviour: the first examples to annotate would be a series with high score and/or scored 0.5 because of the pre-defined patterns I have used for bootsrapping. Instead, the predicted high score (0.7) examples do come first, but the others rated by the patterns are slowly trickling down in each session.
Did I somehow manage to brake my stream?
I use these two overriding functions:
# highlight spans passed as _spans with the texts def overwrite_spans(stream): for eg in stream: eg["spans"] = eg["_spans"] yield eg #show each text only once def filter_stream(stream): seen = set() for eg in stream: # Get the hash idenfitying the original input, e.g. the text input_hash = eg["_input_hash"] if input_hash not in seen: yield eg seen.add(input_hash) stream = prefer_high_scores(predict(stream)) stream = filter_stream(stream) stream = overwrite_spans(stream)