Building a more complex suggester function for spancat

thalish · November 30, 2021, 7:55pm

Hi Team,

I am trying to build a more complex suggester function for use in spancat. Basically my suggester function is going to define the spans [character or token spans] in the text that are eligible to be annotated by spancat manual.

Any resources on how I might go around achieving this? The example in the official docs show how to define the range on ngrams and they allude to how more complex suggester functions can be built.

From the docs:

"You can also implement more sophisticated suggesters, for instance to consider all noun phrases in Doc.noun_chunks , or only certain spans defined by match patterns"

I'm unclear what the format of the return type of the suggester function should be to achieve what I want.

Thanks!

adriane · December 1, 2021, 7:18am

You'll probably want to wait for some bug fixes in spacy v3.2.1 before trying this out, especially if you have a suggester that may suggest 0 spans for some docs.

I kind of hesitate to link this (it's not very polished and at the very least it's not efficient), but I always find more examples helpful personally, so here's an example of a noun chunk-based suggester (it's noun chunks +/- two tokens on each end):

github.com

adrianeboyd/projects/blob/b5a75f7dc512da4d9dd639983f7d519f622464c8/experimental/ner_confidence/scripts/code.py

from typing import Optional, Iterable, cast
from thinc.api import get_current_ops, Ops
from thinc.types import Ragged, Ints1d

from spacy.pipeline.spancat import Suggester
from spacy.tokens import Doc
from spacy.util import registry


@registry.misc("noun_chunk_suggester.v1")
def build_noun_chunk_suggester() -> Suggester:
    def noun_chunk_suggester(
        docs: Iterable[Doc], *, ops: Optional[Ops] = None
    ) -> Ragged:
        if ops is None:
            ops = get_current_ops()
        spans = []
        lengths = []
        for doc in docs:
            if doc.has_annotation("DEP"):

This file has been truncated. show original

You want to return a Ragged object that contains a flattened list of the span offsets for all the docs in the batch, where each length in lengths is the number of spans for each doc in that batch, so it's possible to split up the flat list by doc again. If there are no spans in a particular doc, the length needs to be 0. You can try passing this method a list of docs that were already annotated with en_core_web_sm or similar to get a feel for what the returned Ragged object will look like.

(Note that actually training with this suggester requires the config to source quite a few components (tok2vec, tagger, parser, attribute ruler) from a pipeline like en_core_web_sm and add them to both frozen and annotating components.)

thalish · December 5, 2021, 6:06pm

Do you have a tentative release date for spacy v3.2.1? I'm looking forward to building a new spancat pipeline and would appreciate any information you can provide on this.

Thanks!

adriane · December 6, 2021, 12:47pm

I can't make any promises, but hopefully v3.2.1 will be published this week.

Topic		Replies	Views
Custom suggester function in spancat	1	435	March 6, 2023
Spancat : surrounding text used as context?	3	362	June 23, 2022
training long sequence on spancat memory problem spancat	1	392	March 29, 2023
Can't get phrase matching to work spancat	3	295	June 27, 2023
Different tokenizer during annotation	2	219	March 9, 2024

Building a more complex suggester function for spancat

Related topics