Simple way of getting tagged/marked words and phrases after span categorization task

Good day,

Is there a simple way to get all the words and phrases that we have tagged in a span categorization task? We are using this task to be able to get common keywords/phrases in the text.


Hi Joe,

could you clarify what you mean with "get all the words and phrases that we have tagged"? You could use the db-out command but I'm not 100% if that's what you mean.

Sorry, I mean to be able to get the actual words and phrases within a custom recipe (python code). Because looking at the example on 'accept', it contains only the start and end numbers of the tokens.


You could fetch those via the update callback. Here's a custom recipe with a demo.

from pathlib import Path
from typing import Any, Dict

import prodigy
import spacy
from prodigy import get_stream
from prodigy.components.preprocess import add_tokens

def update(answers):
    for answer in answers:
        for span in answer['spans']:
            start = span['start']
            end = span['end']
            print(f"I found this span: {answer['text'][start:end]}")

    dataset=("Dataset to save annotations to", "positional", None, str),
    lang=("language for the tokeniser", "positional", None, str),
    source=("Data to annotate", "positional", None, str),
    labels=("comma separated sequence of labels", "option", "l", str),
def special(dataset: str, lang: str, source: Path, labels: str) -> Dict[str, Any]:
    nlp = spacy.blank(lang)
    labels = labels.split(",")
    stream = get_stream(source, rehash=True, input_key="text", dedup=True)
    stream = add_tokens(nlp, stream, skip=True)

    return {
        "dataset": dataset,
        "stream": stream,
        "view_id": "spans_manual",
        "update": update,
        "config": {
            "lang": lang,
            "labels": labels,
            "batch_size": 1

This recipe is in a folder with an examples.jsonl file that contains:

{"text": "hi my name is Vincent D. Warmerdam"}
{"text": "hi my name is Johnny Bravo"}

When I run this command:

python -m prodigy spancat.special issue-6230 en examples.jsonl --labels name -F

I get this annotation interface.

It's very much just the spancat interface, but try to annotate a single example and hit "save". When you do, you should see a print appear with the selected span. Does this suffice?


Yes, thank you very much!

1 Like