Text classification `accept` showing both matcher spans and annotator labels

JeanneC · March 22, 2021, 3:05am

Hello,

I currently have a text classification recipe that uses the Spacy RulesMatcher to identify and highlight spans of text. The point of the highlighting is to flag keywords to help annotators assign labels to sentences. An example of the UI is here:

After exporting the data with db-out, I see that the accept field in the JSONL seems to contain both the matcher name (eg. maturity_date) as well as the annotator's labels (eg. Maturity Date). This means that the accept field will often have duplicates such as ('Maturity Date' and 'maturity_date').

Is this expected behaviour? And if so, is there a recommended way to isolate only the annotator's label without the matcher spans?

Thanks a lot

ines · March 23, 2021, 12:28am

Hi! Just to make sure I understand your recipe correctly: does your matcher also pre-select labels by adding them to the "accept" key of the outgoing tasks? And if so, could you share the code? Maybe the ID it's adding doesn't map the IDs of the "options", or you have the text/ID swapped in the options you provide?

Under the hood, selecting a choice option in the UI will always add the "id" value of the respective option in "options" to the "accept" list. For example, if your option looks like this:

{"id": "MATURITY_DATE", "text": "Maturity Date"}

... the annotator will see "Maturity Date" and if they select it, Prodigy will add "accept": ["MATURITY_DATE"]. If you want to pre-populate the selected options based on matches, that's also what you would stream in. If the data that's loaded in contains an unknown ID, e.g. "accept": ["m_date"], it will be ignored – so maybe that's what's happening.

JeanneC · March 24, 2021, 12:19pm

Hello Ines,

Thanks for the explanation about how "options" are added to the "accept" list. I realised what was happening was like you mentioned - the matcher was pre-selecting labels by adding them to the accept key of outgoing tasks. This was not what we wanted so commenting out that line removed the preselection from the accept key. For reference, the code is below:

def get_stream_with_matches(stream, patterns, nlp): 
    """load patterns file, match text to patterns and return text for annotation with patterns highlighted"""
    # load patterns file and convert to matcher format     
    patterns = srsly.read_jsonl(patterns) 
    patterns_by_label = defaultdict(list) 
    for pattern in patterns: 
        patterns_by_label[pattern["label"]].append(pattern["pattern"]) 
    matcher = Matcher(nlp.vocab) 
    for label, rules in patterns_by_label.items(): 
        matcher.add(label, None, *rules)
    data_tuples = ((eg["text"], eg) for eg in stream) 
    for doc, eg in nlp.pipe(data_tuples, as_tuples=True): 
        spans = [] # matched spans 
        matched_labels = set() # all labels that were matched 
        for match_id, start, end in matcher(doc): 
            span = Span(doc, start, end, label=match_id) 
            matched_labels.add(span.label_) 
            spans.append({"start": span.start_char, "end": span.end_char, "label": span.label_}) 
            eg["spans"] = spans 
            # eg["accept"] = list(matched_labels) 
            yield eg

Thank you for your help!

Topic		Replies	Views
Help with postprocessing annotated data for training multicategory text classification model usage , textcat , solved	3	627	April 17, 2020
Auto-accept behavior for binary classification results in accept when all annotators agree on reject textcat , to-be-released	1	387	January 6, 2022
Can I approve/reject pre labelled text classifications usage , textcat	2	423	February 11, 2020
What is the input format for annotated multi-label text classification data Getting Started textcat , solved	2	728	July 10, 2020
Using Prodigy to confirm or reject existing document labels usage , textcat , solved	2	545	January 5, 2019

Text classification `accept` showing both matcher spans and annotator labels

Related Topics