How do I load the output of ner.gold-to-spacy into spacy?

Hi,

I looked at this link in the spacy documentation around updating the named entity recogntion using custom training data, but the format of the training data referenced in the spacy documentation

TRAIN_DATA = [
    ('Who is Shaka Khan?', {
        'entities': [(7, 17, 'PERSON')]
    }),
    ('I like London and Berlin.', {
        'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]
    })
]

is different than the jsonl format exported from ner.gold-to-spacy. Is there an easy way to load jsonl data for use with spacy?

Thanks!

Hat

Hi! The ner.gold-to-spacy format should give you data that looks like this:

 ["I like London", {"entities": [[7, 13, "LOC"]]}]

That’s pretty much the same format as the examples above (only with a list instead of tuples, since JSON doesn’t know tuples – but that shouldn’t matter). So you should be able to just read in the JSONL file and pass the result in as the training data.

Hi Ines,

Thanks very much for the reply! Should I be able to use the JSONL loader that comes with Prodigy to read the file? When I try to do so I get an error saying invalid JSON?

Apologies if these are novice questions.

No worries!

Ultimately, all you need to do is open the file, iterate over the lines and call json.loads(line) (or, even better, line.strip() to trim whitespace). You can also use the jsonlines Python library if that’s easier.

Alternatively, you could also just copy-paste the data into a Python list – for example:

TRAIN_DATA = [
    ["I like London", {"entities": [[7, 13, "LOC"]]}]
]

Horaay Ines! I was totally backwards on this and you helped me out. I really appreciate it! Got things working now.

2 Likes