Annotating an extractive QA dataset à la SQuAD

vitojph · October 21, 2019, 2:18pm

Hi all,

I'm trying to use prodigy to create a dataset to train an extractive Question Answering system. As in SQuAD, each sample contains a question, a context (ie. a paragraph in natural language, related to the question and likely to contain a possible answer) and the specific answer, extracted directly from the context. Notice that the answer is a substring of the context.

I have already collected a set of pairs of questions and contexts. I need the human annotators to select the span of the context that corresponds with the answer, if any.

So, my initial idea was to combine the functionality of ner.manual to tokenize the context and select the span of the answer along with some kind of custom HTML view to show for each sample both the question and the context. Is there any better approach? Any hints to tackle this? Thanks in advance,

EDIT: I've found this custom question answering recipe but it doesn't fit my needs.

ines · October 22, 2019, 10:10am

Hi! This is probably quite similar to what I would have suggested So if I understand the data correctly, the only additional piece of information you need to display is the question, right? And the context and answer are already covered by the text plus pre-highlighted span you'd use in the ner_manual interface?

Assuming your incoming tasks look like this:

{"question": "...", "text": "...", "tokens": [...], "spans": [...]}

You could then use some custom JavaScript to add the "question" (available as window.prodigy.content.question in this case) before the main content (the highlightable text with the pre-highlighted answer).

document.addEventListener('prodigyupdate', event => {
    const container = document.querySelector('.prodigy-content');
    container.prepend(window.prodigy.content.question)
})

On a related note, this upcoming feature will make these combined interfaces a lot easier. You could then just have two blocks: HTML and the manual NER interface.

https://twitter.com/_inesmontani/status/1186275862527635456

vitojph · October 23, 2019, 7:50am

Thanks @ines. As you suggest, I'll try to modify the Javascript to display the question

BTW, I saw your tweet and the new feature looks cool. Looking forward to use it.

vitojph · October 24, 2019, 9:25am

Hi, again!

I found a solution to show both the question and the context, and to able to select the answer span.

Captura%20de%20pantalla%20de%202019-10-24%2011-11-09

Captura%20de%20pantalla%20de%202019-10-24%2011-10-28

Just add this custom Javascript to your ~/.prodigy/prodigy.json file:

{
"javascript": "document.addEventListener('prodigyupdate', event => {const container = document.querySelector('.prodigy-title'); container.innerHTML=window.prodigy.content.question; });document.addEventListener('prodigymount', event => {const container = document.querySelector('.prodigy-title'); container.innerHTML=window.prodigy.content.question; })"
}

And my custom recipe looks like:

import prodigy
import spacy
from prodigy.components.loaders import JSONL
from prodigy.components.preprocess import add_tokens
from prodigy.util import split_string


@prodigy.recipe(
    "qa",
    dataset=("The dataset to use", "positional", None, str),
    spacy_model=("The base model", "positional", None, str),
    source=("The source data as a JSONL file", "positional", None, str),
    label=("One or more comma-separated labels", "option", "l", split_string),
)
def qa(dataset, spacy_model, source, label="answer_span"):
    """Custom recipe to annotate a dataset to train and evalute an extractive Question Answering system"""

    # load the source dataset, made of samples containing question and text pairs
    stream = JSONL(source)
    # load the spaCy model
    nlp = spacy.load(spacy_model)
    # and tokenize the text
    stream = add_tokens(nlp, stream)

    return {
        "view_id": "ner_manual",
        "dataset": dataset,
        "stream": stream,
        "config": {"lang": nlp.lang, "label": label, "labels": label},
    }

Thanks a lot!

vitojph · October 24, 2019, 9:43am

@ines Please let me know if I can send you a PR to https://github.com/explosion/prodigy-recipes to add this recipe

ines · October 25, 2019, 9:02am

@vitojph Thanks and yes, that'd be cool! You can add the "javascript" to the "config" returned by the recipe – this way, everything can fit into one file.

vitojph · October 25, 2019, 6:29pm

@ines I don't seem to have permissions to push new branches to the repo. Can you please give me access? My GitHub account is vitojph.

ines · October 26, 2019, 11:41am

@vitojph Just fork the repo and submit a PR from your fork

sanchez-castro · February 24, 2022, 6:54pm

Hi @vitojph and @ines

My name is Adrian, working on a Squad Q&A project at Stanford.

I have seen that this recipe is available in the repo of prodigy recipes.
Would you confirm the json schema that this recipe supports?

From the function documentation it states:

Annotate question/answer pairs with a custom HTML interface. Expects an
input file with records that look like this:
{"question": "What color is the sky?", "question_answer": "blue"}

Where is the context?
Should I use the key "text"?
Shouldn't it be like:
{"question": "What color is the sky?", "text": <paragraph>, "question_answer": "blue"}

Thanks in advance.

p.s. why are the "others" recipes not part of the basic installation?

Adrian

sanchez-castro · February 24, 2022, 8:23pm

It would be great to have the recipe instructions in the official documentation website too

sanchez-castro · February 25, 2022, 3:22am

ines · February 25, 2022, 11:38am

The example recipe here is intended to do a binary review of question/answer pairs without context, but you could easily include the context as well by modifying the code and including the context in the HTML template: https://github.com/explosion/prodigy-recipes/blob/master/other/question_answering.py

Alternatively, the example recipe shared above takes a slightly different approach

The prodigy-recipes repo is a collection of templates for developing custom recipes with Prodigy. It includes simplified versions of some of the built-in recipes that can be adapted, as well as examples of different use cases and contributions from the community. None of these recipes are included verbatim in the default installation – they're templates for developing your own, or specific examples that are not general-purpose enough to be useful as a built-in recipe, but still interesting.

P.S.: It's not really helpful to bump a thread multiple times within a short time frame, especially not with just single emojis. This makes it a lot harder for us (and othe readers) to follow new posts.

sanchez-castro · March 6, 2022, 1:34am

thanks! but one more emoji

Topic		Replies	Views
Customise Prodigy interface for NLP Q&A Task with Multiple Questions docs , custom , front-end	1	707	July 13, 2022
Question and Answer Tutorial usage , custom , front-end	3	6136	August 10, 2019
javascript in views other then html - add position information to span elements enhancement , done , custom , front-end	6	2644	April 16, 2019
prodigy use case for annotation having pre-annotated text usage , solved	8	1263	March 11, 2019
Custom templates with custom DB and exclude logic usage , custom , solved	20	3054	January 29, 2018

Annotating an extractive QA dataset à la SQuAD

Related topics