how to extend an already labeled corpus?

omrison · June 25, 2019, 7:54am

We have some labeled corpus and its corresponding NER model.
we would like to extend this already labeled corpus with a new entity, in order to create a NER model that will be able to identify the new entity in the sentences (along with the existing labels).
how can this be done using prodigy?

I believe this is done using the “manual named entity interface”, but i am not sure how would the already labeled sentences look like in the interface? would i see the sentence with its labels and the ability to label the new entity? or would i see a naked sentence?

another question,
how do i import the already labeled data without losing the current labels?

ines · June 25, 2019, 7:01pm

Hi! Your idea definitely sounds good: I’d recommend converting your existing data to Prodigy’s JSONL format, and then loading that into the ner.manual recipe. The recipe will respect pre-defined entity spans, so you’ll see what’s already in your data and you’ll be able to correct the annotations and add more annotations for additional labels.

Here’s an example of an entry in the data:

{
    "text": "Hello Apple",
    "tokens": [
        { "text": "Hello", "start": 0, "end": 5, "id": 0 },
        { "text": "Apple", "start": 6, "end": 11, "id": 1 }
    ],
    "spans": [{ "start": 6, "end": 11, "label": "ORG", "token_start": 1, "token_end": 1 }]
}

You can also find more details on the format in the “Annotation task formats” sections in your PRODIGY_README.html. For NER, you’ll need the original text and a list of "spans" describing the character offsets into the text, and the label. If the annotated entity spans in your corpus aren’t consistent with spaCy’s tokenization, you should probably also provide a list of "tokens" to tell Prodigy how to split the text.

Once you’ve converted your data, you can load it into the ner.manual recipe:

python -m prodigy ner.manual your_dataset /path/to/data.jsonl --label LABEL_ONE,LABEL_TWO

As you annotate the examples, they’ll be saved to the dataset and you’ll end up with a complete corpus consisting of the reviewed existing annotations plus your modifications / new labels.

omrison · June 26, 2019, 7:42am

verifying I understand,

this means that i need to convert from our standard format (CoNLL-2003) to Prodigy JSONL format, and at the end convert back from Prodigy JSONL to CoNLL-2003 format?

if the answer is yes,
do you have a standard converter already written?
code snippets? other

ines · June 26, 2019, 7:55am

If your source and desired output format is CoNLL, then yes For Prodigy, we tried to come up with a data format and representation that was easy to read, fully JSON-serializable and relatively easy to process and convert to other formats. That's how we ended up with Prodigy's JSONL format.

Yes, check out this thread:

coco · June 28, 2019, 6:09pm

I had a test to use the example and command above, but failed. ValueError: JSON file needs to contain a list of tasks. so what is the minimal number of tasks I need to import. error

ines · June 29, 2019, 9:16am

@coco What does your JSON file look like inside? I think the problem here is that the outer object isn’t a list but something else. Fo example, the file could look like this:

[
    {"text": "..."},
    {"text": "..."}
]

Whereas the following would be invalid and cause this error:

{"text": "..."}

Topic		Replies	Views
Data format for label correction task based on pre-labelled dataset Getting Started	5	351	June 24, 2022
Re-labling custom dataset with Prodigy usage , ner	2	606	June 28, 2021
CRUD operations on previously labeled spacy data usage , spacy , solved	1	502	November 15, 2021
JSONL with annotation for NET multi-tag for newbies usage , ner	3	664	February 14, 2022
CSV with NER classifications to dataset usage	1	1562	December 13, 2018

how to extend an already labeled corpus?

Related topics