preannotated spans in input json not showing up

Hi,

I wanted to combine annotations that a ner model already does for me with patterns, in order to create a dataset for spancat.
So I thought I'd put spans directly into the json I'm using as input for spans.manual but they're not showing up.

I also tried ents instead of spans but hat didn't work either.
I put in tokens as well and I'm setting token_start and token_end to 0 for all spans/ents.

Any ideas what might be going wrong?

/edit: Found this post Datasets and using pre-annotated data - #4 by ines

I rewrote my code to use the jsonl util

doc = nlp(text)
spans = [{"start": ent.start,"end": ent.end, "label": ent.label_} for ent in list(doc.ents)]
                                        
if random.random() < 1-test_likelyhood:
    train_samples.append({"text": text, "spans": spans})
else:
    test_samples.append({"text": text, "spans": spans})

...

write_jsonl("data/train_samples_ents.jsonl", train_samples)
write_jsonl("data/test_samples_ents.jsonl", test_samples)

Sadly, I stillt don't see the annotations in prodigy :confused:

/edit 2

I noticed that when I don't provide patterns, I get this error:

ValueError: Mismatched tokenization. Can't resolve span to token index 4. This can happen if your data contains pre-set
spans. Make sure that the spans match spaCy's tokenization or add a 'tokens' property to your task.

Even when I add the tokens property the error remains, plus the tokenization should already be matching since I'm using the same model for creating the input data, that I'm also providing to prodigy in the spans.manual recipe for tokenization.

Hi! I think one problem here is that you're using the token indices instead of the start and end character offsets:

The start and end of the "spans" refer to the character offsets, so you want to do something like this:

spans = [{"start": ent.start_char, "end": ent.end_char, "label": ent.label_} for ent in list(doc.ents)]

Thank you, that solved the issue! It's a shame one can't combine patterns and pre-annotated data, but we've had a similar discussion about combining model predictions with data :slight_smile:

As a side note: The documentation for spans (https://spacy.io/api/span) states that start and end are token indices, not char indizes. Is that incorrect then?

Yes, spaCy's Span.start and Span.end return the token indices and Span.start_char and Span.end_char the character offsets.

Prodigy's "spans" use the character offsets as "start" and "end" and use an optional "token_start" and "token_end" for token indices (which is added automatically by Prodigy's built-in NER recipes). This is slightly inconsistent with spaCy, but the reason we did it this way initially is that the static ner interface can also work from character offsets only, so the tokens aren't always 100% necessary. Also, now that we decided on the JSON format, we can't easily go back and change it :sweat_smile:

You should be able to do this in a custom recipe relatively easily, though – you'd just need to decide on the strategy to use for overlaps!

Right, I didn't even think about writing my own recipe. I may look into that and maybe I'll even issue a PR if that's at all interesting?

Sure! If you find an approach that works well, we'd definitely appreciate a PR to prodigy-recipes or alternatively, you can also just share it here or as a GitHub gist (which others can then download and use via -F in Prodigy).

If we can find a solution that generalises well to different strategies, we'd also be interested in shipping this in the core library, either via ner.correct or as an entirely different workflow for comparing annotations from different sources.

1 Like

Since this is probably not a super common use case I'll just share it here.
This is a recipe that uses both patterns and a model to highlight spans (Note: That's different from what this topic was originally about, but this suits my needs the best :wink: ). There is no strategy to prioritize matches or predictions, it's just using both (my use case has no overlap).
Model can either be spancat or ner.

from typing import List, Optional, Union, Iterable
import spacy
from spacy.language import Language
import copy

from prodigy.models.matcher import PatternMatcher
from prodigy.components.preprocess import add_tokens
from prodigy.components.loaders import get_stream
from prodigy.core import recipe
from prodigy.util import log, split_string, get_labels, msg, set_hashes, INPUT_HASH_ATTR
from prodigy.types import RecipeSettingsType, StreamType
from prodigy.recipes.spans import validate_with_suggester, remove_tokens

@recipe(
    "spans.manual_model_x_patterns",
    # fmt: off
    dataset=("Dataset to save annotations to", "positional", None, str),
    spacy_model=("Loadable spaCy pipeline for tokenization or blank:lang (e.g. blank:en)", "positional", None, str),
    source=("Data to annotate (file path or '-' to read from standard input)", "positional", None, str),
    loader=("Loader (guessed from file extension if not set)", "option", "lo", str),
    label=("Comma-separated label(s) to annotate or text file with one label per line", "option", "l", get_labels),
    patterns=("Path to match patterns file", "option", "pt", str),
    exclude=("Comma-separated list of dataset IDs whose annotations to exclude", "option", "e", split_string),
    component=("Name of spancat component in the pipeline", "option", "c", str),
    highlight_chars=("Allow highlighting individual characters instead of tokens", "flag", "C", bool),
    suggester=("Name of suggester function registered in spaCy's 'misc' registry. Will be used to validate annotations as they're submitted. Use the -F option to provide a custom Python file", "option", "sg", str),
    # fmt: on
)
def manual_model_x_patterns(
    dataset: str,
    spacy_model: str,
    source: Union[str, Iterable[dict]],
    loader: Optional[str] = None,
    label: Optional[List[str]] = None,
    patterns: Optional[str] = None,
    exclude: Optional[List[str]] = None,
    component: str = "spancat",
    highlight_chars: bool = False,
    suggester: Optional[str] = None,
) -> RecipeSettingsType:
    """
    Annotate potentially overlapping and nested spans in the data. If
    patterns are provided, their matches are highlighted in the example, if
    available. If a model is provided, it's predictions are highlighted as well. 
    The tokenizer is used to tokenize the incoming texts so the
    selection can snap to token boundaries. You can also set --highlight-chars
    for character-based highlighting.
    """
    log("RECIPE: Starting recipe spans.manual_model_x_patterns", locals())
    nlp = spacy.load(spacy_model)
    if component not in nlp.pipe_names:
        msg.fail(
            f"Can't find component '{component}' in pipeline. Make sure that the "
            f"pipeline you're using includes a trained span categorizer that you "
            f"can correct. If your component has a different name, you can use "
            f"the --component option to specify it.",
            exits=1,
        )
    
    labels = label  # comma-separated list or path to text file
    model_labels = nlp.pipe_labels.get(component, [])
    if not labels:
        labels = model_labels
        if not labels:
            msg.fail("No --label argument set and no labels found in model", exits=1)
        msg.text(f"Using {len(labels)} labels from model: {', '.join(labels)}")
    log(f"RECIPE: Annotating with {len(labels)} labels", labels)
    
    if component == "spancat":
        key = nlp.get_pipe(component).key
        msg.text(f"""Reading spans from key '{key}': doc.spans["{key}"]""")
    elif component == "ner":
        msg.text(f"""Reading ents from model""")
    
    stream = get_stream(
        source, loader=loader, rehash=True, dedup=True, input_key="text"
    )
    
    if patterns is not None:
        pattern_matcher = PatternMatcher(
            nlp, combine_matches=True, all_examples=True, allow_overlap=True
        )
        pattern_matcher = pattern_matcher.from_disk(patterns)
        stream = (eg for _, eg in pattern_matcher(stream))
    # Add "tokens" key to the tasks, either with words or characters
    stream = add_tokens(nlp, stream, use_chars=highlight_chars)
    
    def make_tasks(nlp: Language, stream: StreamType) -> StreamType:
        """Add a 'spans' key to each example, with predicted spans."""
        texts = ((eg["text"], eg) for eg in stream)
        for doc, eg in nlp.pipe(texts, as_tuples=True, batch_size=10):
            task = copy.deepcopy(eg)
            spans = task.get("spans", [])
            
            predicted_spans = []
            if component == "spancat":
                predicted_spans = doc.spans[key]                 
            elif component == "ner":
                predicted_spans = list(doc.ents)
            for span in predicted_spans:
                if labels and span.label_ not in labels:
                    continue
                spans.append(
                    {
                        "token_start": span.start,
                        "token_end": span.end - 1,
                        "start": span.start_char,
                        "end": span.end_char,
                        "text": span.text,
                        "label": span.label_,
                        "source": spacy_model,
                        "input_hash": eg[INPUT_HASH_ATTR],
                    }
                )
            task["spans"] = spans
            task = set_hashes(task)
            yield task
    
    validate_func = validate_with_suggester(nlp, suggester) if suggester else None
    
    stream = make_tasks(nlp, stream)
    
    return {
        "view_id": "spans_manual",
        "dataset": dataset,
        "stream": stream,
        "exclude": exclude,
        "before_db": remove_tokens if highlight_chars else None,
        "validate_answer": validate_func,
        "config": {
            "lang": nlp.lang,
            "labels": labels,
            "exclude_by": "input",
            "ner_manual_highlight_chars": highlight_chars,
            "auto_count_stream": True,
        },
    }

2 Likes