Labelling dataset for extractive text summarization

@salman1993 thanks for the detailed write-up, it helped me reproduce your problem!

The trouble seems to be that you're using an overlapping feed which excludes items based on your session name. Despite this flag being set to support multiple named annotators, you're opening the browser without a session. Consider the two urls:

  • http://localhost:8080 opens the browser using the default session that is generated with the current date+time that the server starts. When using this with named-sessions, you effectively get a new session each time you restart the server. This is why you keep seeing the same questions.
  • http://localhost:8080/?session=my_name opens the browser with a fixed session name, so that restarting the server doesn't cause the annotations to start from the beginning

So you can use a named session for your annotations, or you can disable them and get the behavior you want by visiting the first URL. Do that by setting feed_overlap to False in your recipe:

import prodigy
from prodigy.components.loaders import JSONL
from prodigy.util import set_hashes


@prodigy.recipe(
    "extsumm",
    dataset=("The dataset to save to", "positional", None, str),
    file_path=("Path to texts", "positional", None, str),
)
def extsumm(dataset, file_path):
    """Annotate sentences of a document to be included in extractive summary
    or not."""

    def get_stream():
        stream = JSONL(file_path)  # load in the JSONL file
        for eg in stream:
            eg["text"] = "Tick messages to be included in summary"
            eg = set_hashes(eg, input_keys=("id"))  # CHANGE
            yield eg

    return {
        "dataset": dataset,  # save annotations in this dataset
        "view_id": "choice",  # use the choice interface
        "stream": get_stream(),
        "config": {
            "choice_style": "multiple",
            "feed_overlap": False,
        },
    }
1 Like