Custom recipe - stream loops through all examples.

Hi!

I have written a custom recipe to load data from an existing dataset in my postgres db. The problem I'm having, is that it seems that the stream is looped over completely before serving:

import prodigy
import spacy
from prodigy.components.loaders import get_stream
from prodigy.components.preprocess import add_tokens
from prodigy.types import StreamType

# make the config work for rel_component
from custom_training.rel_component.scripts.rel_model import (
    create_classification_layer, create_instances, create_relation_model)
from custom_training.rel_component.scripts.rel_pipe import (
    make_relation_extractor, score_relations)

color_map = {
    "SIBLING": "#ffd882",
    "PARENT": "#c5bdf4",
    "SAME_AS": "#d9fbad"
}


nlp = spacy.load("ref_ner/training/model-best")
rel_model = spacy.load("custom_training/rel_component/training/model-best")
nlp.add_pipe("relation_extractor", source=rel_model, name="relation_extractor", after="ner")


def add_relations_to_stream(stream) -> StreamType:
    for eg in stream:
        doc = nlp(eg["text"])
        eg["relations"] = []
        eg["spans"] = []
        ent_map = {ent.start: ent for ent in doc.ents}
        for ent in doc.ents:
            span = dict(start=ent.start_char, end=ent.end_char,
                        token_start=ent.start, token_end=ent.end, label=ent.label_)
            eg["spans"].append(span)
        for (head, child), rel in doc._.rel.items():
            rev_rel = {v: k for k, v in rel.items()}
            val = max(rev_rel.keys())
            if val < 0.5:
                continue
            label = rev_rel[val].upper()
            head_ent = ent_map[head]
            head_span = dict(start=head_ent.start_char, end=head_ent.end_char, token_start=head_ent.start, token_end=head_ent.end, label=head_ent.label_)
            child_ent = ent_map[child]
            child_span = dict(start=child_ent.start_char, end=child_ent.end_char,
                             token_start=child_ent.start, token_end=child_ent.end, label=child_ent.label_)
            eg["relations"].append({"head": head, "head_span": head_span, "child": child, "child_span": child_span, "label": label, "color": color_map[label]})
        yield eg


@prodigy.recipe(
    "ref-rel",
    dataset=("Dataset to save answers to", "positional", None, str),
    source=("Source texts", "positional", None, str)
)
def custom_dep_recipe(dataset, source):
    stream = get_stream(
        source, None, None, rehash=True, dedup=True, input_key="text", is_binary=False
    )
    # stream = add_tokens(spacy.blank("it"), stream)  # data comes from an existing dataset with tokens
    stream = add_relations_to_stream(stream) # add custom relations

    return {
        "dataset": dataset,      # dataset to save annotations to
        "stream": stream,        # the incoming stream of examples
        "view_id": "relations",  # annotation interface to use
        "config": {
            "labels": ["PARENT", "SIBLING", "SAME_AS"],  # labels to annotate
            "span-labels": ["J-REF", "L-REF"]
        }
        
    }

If I comment stream = add_relations_to_stream(stream) all works fine (except... I don't get my relations), otherwise it seems to loop through all the existing dataset instead of actually yielding one example at a time.

As a matter of fact, if I place a print(eg) right below the yield eg in add_relations_to_stream, it starts printing every example.

Is it possible that somewhere the stream attribute that is returned gets converted to a list?

Or am I missing something very obvious?

I am running this with:

prodigy ref-rel mydataset_ref_rel dataset:mydataset_ref -F ref-rel.py

Thanks!

Specs:

macOS Monterey
Python==3.9.5
prodigy==1.11.7
spacy==3.2.2

I don't have access to your custom code, so it's hard for me to know for sure what's happening, but I was able to run this variant of your code just now:

import prodigy
import spacy
from prodigy.types import StreamType
from prodigy.components.loaders import get_stream
from prodigy.components.preprocess import add_tokens

color_map = {
    "SIBLING": "#ffd882",
    "PARENT": "#c5bdf4",
    "SAME_AS": "#d9fbad"
}


nlp = spacy.load("en_core_web_md")


def add_relations_to_stream(stream) -> StreamType:
    for i, eg in enumerate(stream):
        doc = nlp(eg["text"])
        eg["relations"] = []
        eg["spans"] = []
        for ent in doc.ents:
            span = dict(start=ent.start_char, end=ent.end_char,
                        token_start=ent.start, token_end=ent.end, label=ent.label_)
            eg["spans"].append(span)
        print(i)
        yield eg


@prodigy.recipe(
    "ref-rel",
    dataset=("Dataset to save answers to", "positional", None, str),
    source=("Source texts", "positional", None, str)
)
def custom_dep_recipe(dataset, source):
    stream = get_stream(
        source, None, None, rehash=True, dedup=True, input_key="text", is_binary=False
    )
    stream = add_relations_to_stream(stream) # add custom relations
    stream = add_tokens(nlp, stream, skip=True)

    return {
        "dataset": dataset,      # dataset to save annotations to
        "stream": stream,        # the incoming stream of examples
        "view_id": "relations",  # annotation interface to use
        "config": {
            "labels": ["PARENT", "SIBLING", "SAME_AS"],  # labels to annotate
            "span-labels": ["J-REF", "L-REF"]
        }
        
    }

When I run this, this is the output I see:

> python -m prodigy ref-rel ref-rel-demo examples.jsonl -F demorecipe.py
0

✨  Starting the web server at http://localhost:8080 ...
Open the app in your browser and start annotating!

# After opening the browser this appears:
1
2
3
4
5
6
7
8
9

My examples.jsonl file has 56 examples, but I only see it enumerate to #9. That's because my configuration is to have a batch size of 10. Before diving in further, can you confirm that your batch size is not the culprit here?

The get_stream function call will internally use a list of items, not a generator, because you're using the dataset:<name> syntax. But that's only internally. Because of the way we filter the final output should become a StreamType again. Also, your add_relations_to_stream is a generator, so I'm a little bit surprised that it would loop over all the items. Could you varify with an enumerator that it's really looping over all the items?

Thanks!

Well...

Related gif

:sweat_smile:

What I pasted above is all of my custom recipe.

This is my .prodigy.json

{
  "theme": "basic",
  "custom_theme": {
    "cardMaxWidth": "95%",
    "smallText": 16,
    "relationHeightWrap": 40
  },
  "db": "postgresql",
  "db_settings": {
    "postgresql": {

    }
  },
  "buttons": ["accept", "reject", "ignore", "undo"],
  "batch_size": 10,
  "history_size": 10,
  "port": 8080,
  "host": "localhost",
  "cors": true,
  "validate": true,
  "auto_exclude_current": true,
  "instant_submit": false,
  "feed_overlap": false,
  "auto_count_stream": false,
  "total_examples_target": 0,
  "ui_lang": "en",
  "project_info": ["dataset", "session", "lang", "recipe_name", "view_id", "label"],
  "show_stats": false,
  "hide_meta": false,
  "show_flag": false,
  "instructions": false,
  "swipe": false,
  "swipe_gestures": { "left": "accept", "right": "reject" },
  "split_sents_threshold": false,
  "html_template": false,
  "global_css": null,
  "javascript": null,
  "writing_dir": "ltr",
  "show_whitespace": false
}

So yes, batch_size=10

I hope I am not missing something obvious again!

I must stress: don't worry about "obvious cases" :slight_smile: . I'm here to help and, in fact, I'm a bit unsure what's happening at the moment so this certainly falls into the "non obvious"-category. I might also argue that your issue with the Brave browser certainly wasn't obvious :sweat_smile:

I am now wondering about something else. Do you already have labelled instances in your dataset? Prodigy has a mechanism that hashes the input and the task in order to prevent you from labelling duplicates. Is it possible that you already labelled the first "n" examples and that it's trying to find an example that it hasn't found yet?

Also, can you confirm what happens when you remove your add_relations_to_stream function? It seems like a normal generator, but I can imagine that the next() call is going to be much faster when you remove it, which might allow us to see where Prodigy stops iterating.

Thank you!

So, if I remove add_relations_to_stream everything works properly.

I tried a little something and it seems like you found the culprit.

  • The dataset I want to stream is already NER-annotated;
  • The model I load performs both NER and relation extraction;

If I disable NER, the stream works properly. Although, it would be nice if there's a work around to this. Any ideas?

So I went a step further, and loaded a dataset without annotations, now everything works as expected, except for one thing: I cannot annotate spans!

Maybe I'm missing something in the config/return?

return {
        "dataset": dataset,      # dataset to save annotations to
        "stream": stream,        # the incoming stream of examples
        "view_id": "relations",  # annotation interface to use
        "config": {
            "labels": ["PARENT", "SIBLING", "SAME_AS"],  # labels to annotate
            "span-labels": ["J-REF", "L-REF"]
        }  
}

Thanks again!

Could you try changing the span-labels to relations_span_labels in your config key there?

Brilliant, thank you!