Annotating an extractive QA dataset à la SQuAD

Hi all,

I'm trying to use prodigy to create a dataset to train an extractive Question Answering system. As in SQuAD, each sample contains a question, a context (ie. a paragraph in natural language, related to the question and likely to contain a possible answer) and the specific answer, extracted directly from the context. Notice that the answer is a substring of the context.

I have already collected a set of pairs of questions and contexts. I need the human annotators to select the span of the context that corresponds with the answer, if any.

So, my initial idea was to combine the functionality of ner.manual to tokenize the context and select the span of the answer along with some kind of custom HTML view to show for each sample both the question and the context. Is there any better approach? Any hints to tackle this? Thanks in advance,

EDIT: I've found this custom question answering recipe but it doesn't fit my needs.

1 Like

Hi! This is probably quite similar to what I would have suggested :slightly_smiling_face: So if I understand the data correctly, the only additional piece of information you need to display is the question, right? And the context and answer are already covered by the text plus pre-highlighted span you'd use in the ner_manual interface?

Assuming your incoming tasks look like this:

{"question": "...", "text": "...", "tokens": [...], "spans": [...]}

You could then use some custom JavaScript to add the "question" (available as window.prodigy.content.question in this case) before the main content (the highlightable text with the pre-highlighted answer).

document.addEventListener('prodigyupdate', event => {
    const container = document.querySelector('.prodigy-content');

On a related note, this upcoming feature will make these combined interfaces a lot easier. You could then just have two blocks: HTML and the manual NER interface.

Thanks @ines. As you suggest, I'll try to modify the Javascript to display the question :slight_smile:

BTW, I saw your tweet and the new feature looks cool. Looking forward to use it.

1 Like

Hi, again!

I found a solution to show both the question and the context, and to able to select the answer span.



Just add this custom Javascript to your ~/.prodigy/prodigy.json file:

"javascript": "document.addEventListener('prodigyupdate', event => {const container = document.querySelector('.prodigy-title'); container.innerHTML=window.prodigy.content.question; });document.addEventListener('prodigymount', event => {const container = document.querySelector('.prodigy-title'); container.innerHTML=window.prodigy.content.question; })"

And my custom recipe looks like:

import prodigy
import spacy
from prodigy.components.loaders import JSONL
from prodigy.components.preprocess import add_tokens
from prodigy.util import split_string

    dataset=("The dataset to use", "positional", None, str),
    spacy_model=("The base model", "positional", None, str),
    source=("The source data as a JSONL file", "positional", None, str),
    label=("One or more comma-separated labels", "option", "l", split_string),
def qa(dataset, spacy_model, source, label="answer_span"):
    """Custom recipe to annotate a dataset to train and evalute an extractive Question Answering system"""

    # load the source dataset, made of samples containing question and text pairs
    stream = JSONL(source)
    # load the spaCy model
    nlp = spacy.load(spacy_model)
    # and tokenize the text
    stream = add_tokens(nlp, stream)

    return {
        "view_id": "ner_manual",
        "dataset": dataset,
        "stream": stream,
        "config": {"lang": nlp.lang, "label": label, "labels": label},

Thanks a lot!

1 Like

@ines Please let me know if I can send you a PR to to add this recipe

@vitojph Thanks and yes, that'd be cool! You can add the "javascript" to the "config" returned by the recipe – this way, everything can fit into one file.

@ines I don't seem to have permissions to push new branches to the repo. Can you please give me access? My GitHub account is vitojph.

@vitojph Just fork the repo and submit a PR from your fork :slightly_smiling_face:

1 Like

Hi @vitojph and @ines

My name is Adrian, working on a Squad Q&A project at Stanford.

I have seen that this recipe is available in the repo of prodigy recipes.
Would you confirm the json schema that this recipe supports?

From the function documentation it states:

Annotate question/answer pairs with a custom HTML interface. Expects an
input file with records that look like this:
{"question": "What color is the sky?", "question_answer": "blue"}

Where is the context?
Should I use the key "text"?
Shouldn't it be like:
{"question": "What color is the sky?", "text": <paragraph>, "question_answer": "blue"}

Thanks in advance.

p.s. why are the "others" recipes not part of the basic installation?


It would be great to have the recipe instructions in the official documentation website too :slight_smile:


The example recipe here is intended to do a binary review of question/answer pairs without context, but you could easily include the context as well by modifying the code and including the context in the HTML template: prodigy-recipes/ at master · explosion/prodigy-recipes · GitHub

Alternatively, the example recipe shared above takes a slightly different approach

The prodigy-recipes repo is a collection of templates for developing custom recipes with Prodigy. It includes simplified versions of some of the built-in recipes that can be adapted, as well as examples of different use cases and contributions from the community. None of these recipes are included verbatim in the default installation – they're templates for developing your own, or specific examples that are not general-purpose enough to be useful as a built-in recipe, but still interesting.

P.S.: It's not really helpful to bump a thread multiple times within a short time frame, especially not with just single emojis. This makes it a lot harder for us (and othe readers) to follow new posts.

1 Like

thanks! but one more emoji