Text classification `accept` showing both matcher spans and annotator labels


I currently have a text classification recipe that uses the Spacy RulesMatcher to identify and highlight spans of text. The point of the highlighting is to flag keywords to help annotators assign labels to sentences. An example of the UI is here:

After exporting the data with db-out, I see that the accept field in the JSONL seems to contain both the matcher name (eg. maturity_date) as well as the annotator's labels (eg. Maturity Date). This means that the accept field will often have duplicates such as ('Maturity Date' and 'maturity_date').

Is this expected behaviour? And if so, is there a recommended way to isolate only the annotator's label without the matcher spans?

Thanks a lot :pray:

Hi! Just to make sure I understand your recipe correctly: does your matcher also pre-select labels by adding them to the "accept" key of the outgoing tasks? And if so, could you share the code? Maybe the ID it's adding doesn't map the IDs of the "options", or you have the text/ID swapped in the options you provide?

Under the hood, selecting a choice option in the UI will always add the "id" value of the respective option in "options" to the "accept" list. For example, if your option looks like this:

{"id": "MATURITY_DATE", "text": "Maturity Date"}

... the annotator will see "Maturity Date" and if they select it, Prodigy will add "accept": ["MATURITY_DATE"]. If you want to pre-populate the selected options based on matches, that's also what you would stream in. If the data that's loaded in contains an unknown ID, e.g. "accept": ["m_date"], it will be ignored – so maybe that's what's happening.

Hello Ines,

Thanks for the explanation about how "options" are added to the "accept" list. I realised what was happening was like you mentioned - the matcher was pre-selecting labels by adding them to the accept key of outgoing tasks. This was not what we wanted so commenting out that line removed the preselection from the accept key. For reference, the code is below:

def get_stream_with_matches(stream, patterns, nlp): 
    """load patterns file, match text to patterns and return text for annotation with patterns highlighted"""
    # load patterns file and convert to matcher format     
    patterns = srsly.read_jsonl(patterns) 
    patterns_by_label = defaultdict(list) 
    for pattern in patterns: 
    matcher = Matcher(nlp.vocab) 
    for label, rules in patterns_by_label.items(): 
        matcher.add(label, None, *rules)
    data_tuples = ((eg["text"], eg) for eg in stream) 
    for doc, eg in nlp.pipe(data_tuples, as_tuples=True): 
        spans = [] # matched spans 
        matched_labels = set() # all labels that were matched 
        for match_id, start, end in matcher(doc): 
            span = Span(doc, start, end, label=match_id) 
            spans.append({"start": span.start_char, "end": span.end_char, "label": span.label_}) 
            eg["spans"] = spans 
            # eg["accept"] = list(matched_labels) 
            yield eg

Thank you for your help!

1 Like