I am trying to build a more complex suggester function for use in spancat. Basically my suggester function is going to define the spans [character or token spans] in the text that are eligible to be annotated by spancat manual.
Any resources on how I might go around achieving this? The example in the official docs show how to define the range on ngrams and they allude to how more complex suggester functions can be built.
From the docs:
"You can also implement more sophisticated suggesters, for instance to consider all noun phrases in Doc.noun_chunks , or only certain spans defined by match patterns"
I'm unclear what the format of the return type of the suggester function should be to achieve what I want.
You'll probably want to wait for some bug fixes in spacy v3.2.1 before trying this out, especially if you have a suggester that may suggest 0 spans for some docs.
I kind of hesitate to link this (it's not very polished and at the very least it's not efficient), but I always find more examples helpful personally, so here's an example of a noun chunk-based suggester (it's noun chunks +/- two tokens on each end):
You want to return a Ragged object that contains a flattened list of the span offsets for all the docs in the batch, where each length in lengths is the number of spans for each doc in that batch, so it's possible to split up the flat list by doc again. If there are no spans in a particular doc, the length needs to be 0. You can try passing this method a list of docs that were already annotated with en_core_web_sm or similar to get a feel for what the returned Ragged object will look like.
(Note that actually training with this suggester requires the config to source quite a few components (tok2vec, tagger, parser, attribute ruler) from a pipeline like en_core_web_sm and add them to both frozen and annotating components.)