Data format for label correction task based on pre-labelled dataset

Hi,
Newbie. I am trying to do a NER labelling task in which I want to correct an already labelled dataset.

I have a dataframe in which I have already done (based on regex, lists etc.) labelling and I want to use this labelled data in Prodigy to start correcting some of those labels (since this initial labelling is not perfect of course). I have a huge ''pre-labelled'' set based on this.

To give an idea (some records with only O as label) how my dataframe looks like:

Now I know I have to convert it to JSONL format, however I am not sure on format content it should be so I can load it into Prodigy and to start correcting the labels (with ner.manual). I know I don't need ner.correct since that is if I already have a trained model (if I am correct on this :wink:

As anology I don't want to label based on a pattern file, but based on an already pre-labelled file (anology comes from projects/ner-food-ingredients at master · explosion/projects · GitHub, which is a tutorial on NER food ingredients.)

Can someone point me in the right direction?
Best regards

It depends a bit on what recipe you'd like to use. If you're fine with using ner.manual then you can just feed it the un-tokenized text but you won't have any entities pre-highlighted.

The tricky thing to watch out for here is the tokenizer. Prodigy will assume a spaCy tokenizer unless you provide a custom one. From glancing at your tokens, it seems like they're perhaps generated in a compatible way, but it deserves double checking ... what tokeniser was used here? You can compare with spaCy by running:

import spacy

# Why a blank model? https://www.youtube.com/watch?v=foHTpmFPmwc
nlp = spacy.blank("nl")
# Tokenize some text
[t for t in nlp("Ik beveel nooit aan ... Dus dit is niet omdat je")]

# [Ik, beveel, nooit, aan, ..., Dus, dit, is, niet, omdat, je]

Hi Vincent,
tnx for your reply, Yes, I actually used the spacy tokenizer. And yes I want to have the spans pre-highlighted so I can go trough the labelling more "quickly".

See it like this: I have thousands of records labelled using lists and regex. But this is like a big "silver" dataset. Of this "silver dataset" I want to take a part (say for example 5K records) and actually go through it to see if there are any errors in the labelling. After I inspected those I have a golden set and want to use that for intial model building (and then maybe use this model to start correcting part of the other records of the silver dataset).

My main thing is: I want to see the pre-labelled entities in Prodigy already of my 5K records which I want to correct, but I am unsure on how this data format (contents) should look like so I can load it in Prodigy.

I think the easiest way for you to pre-highlight the text is to use patterns. These also allow you to use regexes and wordlist too. You can use any patterns that spaCy can handle and especially for the long term this'd be the most maintainable solution.

This workflow is what the ner.manual recipe was designed for. If you really want to omit patterns, you could alternatively go for the custom recipe route. You can use the ner view-id, which expects data to be in this format:

{
  "text": "First look at the new MacBook Pro",
  "spans": [
    {"start": 22, "end": 33, "label": "PRODUCT", "token_start": 5, "token_end": 6}
  ],
  "tokens": [
    {"text": "First", "start": 0, "end": 5, "id": 0},
    {"text": "look", "start": 6, "end": 10, "id": 1},
    {"text": "at", "start": 11, "end": 13, "id": 2},
    {"text": "the", "start": 14, "end": 17, "id": 3},
    {"text": "new", "start": 18, "end": 21, "id": 4},
    {"text": "MacBook", "start": 22, "end": 29, "id": 5},
    {"text": "Pro", "start": 30, "end": 33, "id": 6}
  ]
}

I wrote a quick custom recipe that can read this (assuming it's saved and formatted in a jsonl file) and can be used as a starting point for custom logic on your end.

import prodigy
from prodigy.components.preprocess import add_tokens
import spacy

name_list = ['vincent']

@prodigy.recipe(
    "ner.custom",
    dataset=("Dataset to save answers to", "positional", None, str),
    filepath=("Filepath with examples.", "positional", None, str),
)
def search(dataset, filepath):
    nlp = spacy.blank("en")
    # Load your own streams from anywhere you want
    stream = prodigy.get_stream(filepath, rehash=True, dedup=True)

    # You can also add some custom code if you like
    # stream = (add_spans_with_custom_logic(e) for e in add_tokens(nlp, stream, skip=True))

    return {
        "dataset": dataset,
        "view_id": "ner_manual",
        "stream": stream,
        "config": {
            "labels": ["product"]
        }
    }

When I run this command on my machine:

python -m prodigy ner.custom throwawayz example.jsonl -F recipe.py

I get this interface:

Oh, and I almost forgot. Are you using BILOU? If so, you might appreciate these helper functions.

Great Vincent thank you for these suggested options. I will try to write it to the jsonl input format since I already have my data in a kind of "neat" format along with the start and end of the token. We also used Spacy tokenizers for this. See below an example for how 1 record looks:

Curious to see if Prodigy will pick up those labels (B-PER and also I-PER etc.) for the labels automatically when reading in the data. Would be cool