We have some labeled corpus and its corresponding NER model.
we would like to extend this already labeled corpus with a new entity, in order to create a NER model that will be able to identify the new entity in the sentences (along with the existing labels).
how can this be done using prodigy?
I believe this is done using the “manual named entity interface”, but i am not sure how would the already labeled sentences look like in the interface? would i see the sentence with its labels and the ability to label the new entity? or would i see a naked sentence?
how do i import the already labeled data without losing the current labels?
Hi! Your idea definitely sounds good: I’d recommend converting your existing data to Prodigy’s JSONL format, and then loading that into the ner.manual recipe. The recipe will respect pre-defined entity spans, so you’ll see what’s already in your data and you’ll be able to correct the annotations and add more annotations for additional labels.
You can also find more details on the format in the “Annotation task formats” sections in your PRODIGY_README.html. For NER, you’ll need the original text and a list of "spans" describing the character offsets into the text, and the label. If the annotated entity spans in your corpus aren’t consistent with spaCy’s tokenization, you should probably also provide a list of "tokens" to tell Prodigy how to split the text.
Once you’ve converted your data, you can load it into the ner.manual recipe:
If your source and desired output format is CoNLL, then yes For Prodigy, we tried to come up with a data format and representation that was easy to read, fully JSON-serializable and relatively easy to process and convert to other formats. That's how we ended up with Prodigy's JSONL format.