Annotating an extractive QA dataset à la SQuAD

Hi all,

I'm trying to use prodigy to create a dataset to train an extractive Question Answering system. As in SQuAD, each sample contains a question, a context (ie. a paragraph in natural language, related to the question and likely to contain a possible answer) and the specific answer, extracted directly from the context. Notice that the answer is a substring of the context.

I have already collected a set of pairs of questions and contexts. I need the human annotators to select the span of the context that corresponds with the answer, if any.

So, my initial idea was to combine the functionality of ner.manual to tokenize the context and select the span of the answer along with some kind of custom HTML view to show for each sample both the question and the context. Is there any better approach? Any hints to tackle this? Thanks in advance,

EDIT: I've found this custom question answering recipe but it doesn't fit my needs.

Hi! This is probably quite similar to what I would have suggested :slightly_smiling_face: So if I understand the data correctly, the only additional piece of information you need to display is the question, right? And the context and answer are already covered by the text plus pre-highlighted span you'd use in the ner_manual interface?

Assuming your incoming tasks look like this:

{"question": "...", "text": "...", "tokens": [...], "spans": [...]}

You could then use some custom JavaScript to add the "question" (available as window.prodigy.content.question in this case) before the main content (the highlightable text with the pre-highlighted answer).

document.addEventListener('prodigyupdate', event => {
    const container = document.querySelector('.prodigy-content');
    container.prepend(window.prodigy.content.question)
})

On a related note, this upcoming feature will make these combined interfaces a lot easier. You could then just have two blocks: HTML and the manual NER interface.

Thanks @ines. As you suggest, I'll try to modify the Javascript to display the question :slight_smile:

BTW, I saw your tweet and the new feature looks cool. Looking forward to use it.

1 Like

Hi, again!

I found a solution to show both the question and the context, and to able to select the answer span.

Captura%20de%20pantalla%20de%202019-10-24%2011-11-09

Captura%20de%20pantalla%20de%202019-10-24%2011-10-28

Just add this custom Javascript to your ~/.prodigy/prodigy.json file:

{
"javascript": "document.addEventListener('prodigyupdate', event => {const container = document.querySelector('.prodigy-title'); container.innerHTML=window.prodigy.content.question; });document.addEventListener('prodigymount', event => {const container = document.querySelector('.prodigy-title'); container.innerHTML=window.prodigy.content.question; })"
}

And my custom recipe looks like:

import prodigy
import spacy
from prodigy.components.loaders import JSONL
from prodigy.components.preprocess import add_tokens
from prodigy.util import split_string


@prodigy.recipe(
    "qa",
    dataset=("The dataset to use", "positional", None, str),
    spacy_model=("The base model", "positional", None, str),
    source=("The source data as a JSONL file", "positional", None, str),
    label=("One or more comma-separated labels", "option", "l", split_string),
)
def qa(dataset, spacy_model, source, label="answer_span"):
    """Custom recipe to annotate a dataset to train and evalute an extractive Question Answering system"""

    # load the source dataset, made of samples containing question and text pairs
    stream = JSONL(source)
    # load the spaCy model
    nlp = spacy.load(spacy_model)
    # and tokenize the text
    stream = add_tokens(nlp, stream)

    return {
        "view_id": "ner_manual",
        "dataset": dataset,
        "stream": stream,
        "config": {"lang": nlp.lang, "label": label, "labels": label},
    }

Thanks a lot!

@ines Please let me know if I can send you a PR to https://github.com/explosion/prodigy-recipes to add this recipe

@vitojph Thanks and yes, that'd be cool! You can add the "javascript" to the "config" returned by the recipe – this way, everything can fit into one file.

@ines I don't seem to have permissions to push new branches to the repo. Can you please give me access? My GitHub account is vitojph.

@vitojph Just fork the repo and submit a PR from your fork :slightly_smiling_face:

1 Like