Simple way of getting tagged/marked words and phrases after span categorization task

joebuckle · January 12, 2023, 2:23am

Good day,

Is there a simple way to get all the words and phrases that we have tagged in a span categorization task? We are using this task to be able to get common keywords/phrases in the text.

Regards,
Joe

koaning · January 12, 2023, 3:15pm

Hi Joe,

could you clarify what you mean with "get all the words and phrases that we have tagged"? You could use the db-out command but I'm not 100% if that's what you mean.

joebuckle · January 12, 2023, 10:43pm

Sorry, I mean to be able to get the actual words and phrases within a custom recipe (python code). Because looking at the example on 'accept', it contains only the start and end numbers of the tokens.

Thanks.

koaning · January 13, 2023, 10:02am

You could fetch those via the update callback. Here's a custom recipe with a demo.

from pathlib import Path
from typing import Any, Dict

import prodigy
import spacy
from prodigy import get_stream
from prodigy.components.preprocess import add_tokens


def update(answers):
    for answer in answers:
        for span in answer['spans']:
            start = span['start']
            end = span['end']
            print(f"I found this span: {answer['text'][start:end]}")

@prodigy.recipe(
    "spancat.special",
    dataset=("Dataset to save annotations to", "positional", None, str),
    lang=("language for the tokeniser", "positional", None, str),
    source=("Data to annotate", "positional", None, str),
    labels=("comma separated sequence of labels", "option", "l", str),
)
def special(dataset: str, lang: str, source: Path, labels: str) -> Dict[str, Any]:
    nlp = spacy.blank(lang)
    labels = labels.split(",")
    stream = get_stream(source, rehash=True, input_key="text", dedup=True)
    stream = add_tokens(nlp, stream, skip=True)

    return {
        "dataset": dataset,
        "stream": stream,
        "view_id": "spans_manual",
        "update": update,
        "config": {
            "lang": lang,
            "labels": labels,
            "batch_size": 1
        },
    }

This recipe is in a folder with an examples.jsonl file that contains:

{"text": "hi my name is Vincent D. Warmerdam"}
{"text": "hi my name is Johnny Bravo"}

When I run this command:

python -m prodigy spancat.special issue-6230 en examples.jsonl --labels name -F recipe.py

I get this annotation interface.

It's very much just the spancat interface, but try to annotate a single example and hit "save". When you do, you should see a print appear with the selected span. Does this suffice?

joebuckle · January 14, 2023, 3:19am

Yes, thank you very much!

Topic		Replies	Views
Can't get phrase matching to work spancat	3	295	June 27, 2023
Seeding text categorization with phrases textcat , done , custom	9	4205	March 21, 2018
Translating recipe tags to a Spacy custom pipeline component usage , spacy , coref	4	440	February 25, 2021
Using a costume tokenizer while annotating using a built-in recipe (spans.manual)	2	22	September 4, 2024
Bootstrapping terms with pattern file usage	7	1436	July 9, 2019

Simple way of getting tagged/marked words and phrases after span categorization task

Related topics