textcat.teach presents same annotation task if text snippet contains multiple patterns

ines · February 12, 2019, 12:05pm

Hi! This is currently expected behaviour because the pattern matcher just yields out every result – but you’re right that it’s not very practical and we probably want to change this and make “one match per example” the default behaviour.

If you’re annotating for text classification, you’re giving feedback on the text plus label. The patterns are mostly a means to an end, so if you do see an example with the correct label, you should accept it.

That said, if you do get a lot of duplicate matches, you could also write a function that keeps track of the original example texts you’ve already seen and only yields out an example once:

def filter_stream(stream):
    seen = set()
    for eg in stream:
        # Get the hash idenfitying the original input, e.g. the text 
        input_hash = eg["_input_hash"]
        if input_hash not in seen:
            yield eg
        seen.add(input_hash)

stream = filter_stream(stream)

Topic		Replies	Views
Same task presented for every pattern match enhancement , textcat	1	560	November 30, 2019
Can we bring back --seeds for textcat.teach? textcat , solved	7	522	February 10, 2023
Same text appearing twice (with matches and without) textcat	5	464	December 13, 2022
Pattern files for textcat.teach usage , textcat	20	3749	July 6, 2018
Seeding text categorization with phrases textcat , done , custom	9	4205	March 21, 2018

textcat.teach presents same annotation task if text snippet contains multiple patterns

Related topics