Labelling dataset for extractive text summarization

Hi,

I would like to label a dataset for extractive summarization, where the annotation jsonl file would look like this:

{"document": ["line 1", "summary line 2", "line 3"], "meta": {...}}
{"document": ["summary x 1", "x 2", "x 3", "summary x 4"], "meta": {...}}

I am imagining an interface with checkboxes beside each sentence, and ticking the checkbox would indicate its a positive label whereas unchecked indicates negative. Ultimately, I would like the annotated output to look like this:

{"document": ["line 1", "summary line 2", "line 3"], "labels": [0, 1, 0] "meta": {...}}
{"document": ["summary x 1", "x 2", "x 3", "summary x 4"], "labels": [1, 0, 0, 1] "meta": {...}}

Q1) What would be the easiest way to do this in Prodigy (v1.10.4)?

Q2) Would it be possible to do active learning and have an underlying model make predictions for each sentence to speed up labelling?

Q3) Would it be possible to do uncertainty sampling to pick out good samples for labelling?

Thank you! :slight_smile:

I was actually able to figure out Q1, but still unsure about Q2 and Q3.

Here is the recipe.py file:

import prodigy
from prodigy.components.loaders import JSONL

@prodigy.recipe(
    "extsumm",
    dataset=("The dataset to save to", "positional", None, str),
    file_path=("Path to texts", "positional", None, str),
)
def extsumm(dataset, file_path):
    """Annotate sentences of a document to be included in extractive summary or not."""
    stream = JSONL(file_path)     # load in the JSONL file

    return {
        "dataset": dataset,   # save annotations in this dataset
        "view_id": "choice",  # use the choice interface
        "stream": stream,
        'config': {'choice_style': 'multiple'},
    }

The annotation file - ext_summ_annotation.jsonl - looks like this:

{"text": "Tick imp sentences", "options": [{"id": 0, "text": "line 1"}, {"id": 1, "text": "summary line 2"}, {"id": 2, "text": "line 3"}]}
{"text": "Tick imp sentences", "options": [{"id": 0, "text": "summary x 1"}, {"id": 1, "text": "x 2"}, {"id": 2, "text": "summary x 3"}, {"id": 3, "text": "x 4"}]}

You can run it using: prodigy extsumm extsumm_dataset ext_summ_annotation.jsonl -F recipe.py

(I wrote this comment before I saw your second post – looks like you're definitely on the right track then :+1:)

Hi! The most straightforward and out-of-the-box solution would probably be to use the choice interface and make each of your lines an option. So the data you load in could look like this:

{
    "options": [{"id": 0, "line 1"}, {"id": 1, "summary line 2"}, {"id": 2, "line 3"}],
    "meta": {}
}

When you annotate the data, the IDs of the accepted options will be stored in a key "accept", e.g. "accept": [0, 2]. Everything in "meta" will be displayed in the bottom right corner of the annotation task, but you can also add other custom properties to attach meta info that won't be visible during annotation and will just be passed through with the data.

In the data you stream in, you can pre-define annotations – for example, in this case, you can stream in data with a list of pre-selected options in "accept", predicted by your model. Here's a dummy example (the specifics obviously depend on your model etc.):

def get_stream():
    for document in your_documents:
        options = [{"id": i, "text": line} for line in enumerate(document)]
        # Model goes here and generates list of selected IDs
        selected = predict_selected_lines(document)
        eg = {"options": options, "accept": selected}
        yield eg

If you also want to update the model in the loop, your recipe can implement an update callback that receives batches of annotated examples as they come in from the app and lets you use them to update the model: https://prodi.gy/docs/custom-recipes#update This step might take some more experimentation, though, because you typically need your model to be sensitive enough to small batches of updates but not too sensitive either (which is a slightly unusual design).

Yes, that's also something you could easily implement in your logic that generates the stream and yields out examples for annotation. You'd just need to decide on the heuristic and what you'd consider uncertain in your specific case. So assuming you've calculated a score for each document, you could do something like this and only send it out if it's above/below a given threshold:

if 0.35 <= score <= 0.65:
    yield eg

Prodigy also ships with built-in helpers that take a stream of (score, example) tuples and yield selected examples, e.g. prefer_uncertain for uncertainty sampling: https://prodi.gy/docs/api-components#sorters Instead of just a threshold, they use an exponential moving average to prevent the stream from getting stuck in an unideal state. This is especially useful if you're also updating the model in the loop so you don't end up with no more suggestions if the predictions change and the model starts predicting more higher/lower scores.

1 Like

Hi Ines,

I am not being able to restart annotation from samples that were not annotated previously when I close the session. I have read through the forum and used exclude but its not working for me.

This is what I tried. I modified the input file to have an id field for each example:

{
    "id": 0, # CHANGE
    "options": [{"id": 0, "line 1"}, {"id": 1, "summary line 2"}, {"id": 2, "line 3"}],
    "meta": {}
}

Then I modified the recipe:

import prodigy
from prodigy.components.loaders import JSONL
from prodigy import set_hashes # CHANGE

@prodigy.recipe(
    "extsumm",
    dataset=("The dataset to save to", "positional", None, str),
    file_path=("Path to texts", "positional", None, str),
)
def extsumm(dataset, file_path):
    """Annotate sentences of a document to be included in extractive summary or not."""

    def get_stream():
        stream = JSONL(file_path)  # load in the JSONL file
        for eg in stream:
            eg['text'] = "Tick messages to be included in summary"
            eg = set_hashes(eg, input_keys=("id")) # CHANGE
            yield eg

    return {
        "dataset": dataset,   # save annotations in this dataset
        "view_id": "choice",  # use the choice interface
        "stream": get_stream(),
        'config': {'choice_style': 'multiple'},
        'exclude': [dataset] # datasets to exclude # CHANGE
    }

I checked the input and task hash and it is the same for the same sample but still, the annotation interface starts off from the very beginning instead of skipping the samples already annotated.

What else I tried:

  • just having the exclude without setting hash
  • not adding in the id field for each sample

None of these worked. Please let me know.

Okay, so you've definitely verified that the _task_hash that's generated for an incoming example is the same task hash that's already present in your dataset?

Setting 'exclude': [dataset] in your recipe shouldn't be necessary, since this is the default behaviour.

Also, another thing I noticed in your code: input_keys=("id") should be input_keys=("id",) or input_keys=["id"] – otherwise, the argument will be interpreted as a string instead of a list of keys.

1 Like

I changed this line in the recipe I posted:

eg = set_hashes(eg, input_keys=["id"])

I still have the issue where server starts from the first sample.

Here are the minimum steps to reproduce the problem:

This is my test.jsonl file:

{"id": 0, "text": "Tick imp sentences", "options": [{"id": 0, "text": "line 1"}, {"id": 1, "text": "summary line 2"}, {"id": 2, "text": "line 3"}]}
{"id": 1, "text": "Tick imp sentences", "options": [{"id": 0, "text": "summary x 1"}, {"id": 1, "text": "x 2"}, {"id": 2, "text": "summary x 3"}, {"id": 3, "text": "x 4"}]}

Then I run prodigy extsumm extsumm_dataset test.jsonl -F recipe.py and label 1 sample. Close the server and restart. However, it still starts off from the first sample.

Here is the output from prodigy db-out extsumm_dataset

{"id":0,"text":"Tick messages to be included in summary","options":[{"id":0,"text":"line 1"},{"id":1,"text":"summary line 2"},{"id":2,"text":"line 3"}],"_input_hash":-54856242,"_task_hash":599202434,"_session_id":null,"_view_id":"choice","config":{"choice_style":"multiple"},"accept":[1],"answer":"accept"}
{"id":0,"text":"Tick messages to be included in summary","options":[{"id":0,"text":"line 1"},{"id":1,"text":"summary line 2"},{"id":2,"text":"line 3"}],"_input_hash":-54856242,"_task_hash":599202434,"_session_id":null,"_view_id":"choice","config":{"choice_style":"multiple"},"accept":[1],"answer":"accept"}

Some screenshots with Prodigy basic logging:

1st time I start server & annotate 1 sample:

Closed server and 2nd time when I start & annotate (starts off from first sample again):

@salman1993 thanks for the detailed write-up, it helped me reproduce your problem!

The trouble seems to be that you're using an overlapping feed which excludes items based on your session name. Despite this flag being set to support multiple named annotators, you're opening the browser without a session. Consider the two urls:

  • http://localhost:8080 opens the browser using the default session that is generated with the current date+time that the server starts. When using this with named-sessions, you effectively get a new session each time you restart the server. This is why you keep seeing the same questions.
  • http://localhost:8080/?session=my_name opens the browser with a fixed session name, so that restarting the server doesn't cause the annotations to start from the beginning

So you can use a named session for your annotations, or you can disable them and get the behavior you want by visiting the first URL. Do that by setting feed_overlap to False in your recipe:

import prodigy
from prodigy.components.loaders import JSONL
from prodigy.util import set_hashes


@prodigy.recipe(
    "extsumm",
    dataset=("The dataset to save to", "positional", None, str),
    file_path=("Path to texts", "positional", None, str),
)
def extsumm(dataset, file_path):
    """Annotate sentences of a document to be included in extractive summary
    or not."""

    def get_stream():
        stream = JSONL(file_path)  # load in the JSONL file
        for eg in stream:
            eg["text"] = "Tick messages to be included in summary"
            eg = set_hashes(eg, input_keys=("id"))  # CHANGE
            yield eg

    return {
        "dataset": dataset,  # save annotations in this dataset
        "view_id": "choice",  # use the choice interface
        "stream": get_stream(),
        "config": {
            "choice_style": "multiple",
            "feed_overlap": False,
        },
    }