how to extend an already labeled corpus?

We have some labeled corpus and its corresponding NER model.
we would like to extend this already labeled corpus with a new entity, in order to create a NER model that will be able to identify the new entity in the sentences (along with the existing labels).
how can this be done using prodigy?

I believe this is done using the “manual named entity interface”, but i am not sure how would the already labeled sentences look like in the interface? would i see the sentence with its labels and the ability to label the new entity? or would i see a naked sentence?

another question,
how do i import the already labeled data without losing the current labels?

Hi! Your idea definitely sounds good: I’d recommend converting your existing data to Prodigy’s JSONL format, and then loading that into the ner.manual recipe. The recipe will respect pre-defined entity spans, so you’ll see what’s already in your data and you’ll be able to correct the annotations and add more annotations for additional labels.

Here’s an example of an entry in the data:

{
    "text": "Hello Apple",
    "tokens": [
        { "text": "Hello", "start": 0, "end": 5, "id": 0 },
        { "text": "Apple", "start": 6, "end": 11, "id": 1 }
    ],
    "spans": [{ "start": 6, "end": 11, "label": "ORG", "token_start": 1, "token_end": 1 }]
}

You can also find more details on the format in the “Annotation task formats” sections in your PRODIGY_README.html. For NER, you’ll need the original text and a list of "spans" describing the character offsets into the text, and the label. If the annotated entity spans in your corpus aren’t consistent with spaCy’s tokenization, you should probably also provide a list of "tokens" to tell Prodigy how to split the text.

Once you’ve converted your data, you can load it into the ner.manual recipe:

python -m prodigy ner.manual your_dataset /path/to/data.jsonl --label LABEL_ONE,LABEL_TWO

As you annotate the examples, they’ll be saved to the dataset and you’ll end up with a complete corpus consisting of the reviewed existing annotations plus your modifications / new labels.

verifying I understand,

this means that i need to convert from our standard format (CoNLL-2003) to Prodigy JSONL format, and at the end convert back from Prodigy JSONL to CoNLL-2003 format?

if the answer is yes,
do you have a standard converter already written?
code snippets? other

If your source and desired output format is CoNLL, then yes :slightly_smiling_face: For Prodigy, we tried to come up with a data format and representation that was easy to read, fully JSON-serializable and relatively easy to process and convert to other formats. That's how we ended up with Prodigy's JSONL format.

Yes, check out this thread:

1 Like

I had a test to use the example and command above, but failed. ValueError: JSON file needs to contain a list of tasks. so what is the minimal number of tasks I need to import.error

@coco What does your JSON file look like inside? I think the problem here is that the outer object isn’t a list but something else. Fo example, the file could look like this:

[
    {"text": "..."},
    {"text": "..."}
]

Whereas the following would be invalid and cause this error:

{"text": "..."}