How do I load the output of ner.gold-to-spacy into spacy?

ner
spacy
solved

#1

Hi,

I looked at this link in the spacy documentation around updating the named entity recogntion using custom training data, but the format of the training data referenced in the spacy documentation

TRAIN_DATA = [
    ('Who is Shaka Khan?', {
        'entities': [(7, 17, 'PERSON')]
    }),
    ('I like London and Berlin.', {
        'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]
    })
]

is different than the jsonl format exported from ner.gold-to-spacy. Is there an easy way to load jsonl data for use with spacy?

Thanks!

Hat


(Ines Montani) #2

Hi! The ner.gold-to-spacy format should give you data that looks like this:

 ["I like London", {"entities": [[7, 13, "LOC"]]}]

That’s pretty much the same format as the examples above (only with a list instead of tuples, since JSON doesn’t know tuples – but that shouldn’t matter). So you should be able to just read in the JSONL file and pass the result in as the training data.


#3

Hi Ines,

Thanks very much for the reply! Should I be able to use the JSONL loader that comes with Prodigy to read the file? When I try to do so I get an error saying invalid JSON?

Apologies if these are novice questions.


(Ines Montani) #4

No worries!

Ultimately, all you need to do is open the file, iterate over the lines and call json.loads(line) (or, even better, line.strip() to trim whitespace). You can also use the jsonlines Python library if that’s easier.

Alternatively, you could also just copy-paste the data into a Python list – for example:

TRAIN_DATA = [
    ["I like London", {"entities": [[7, 13, "LOC"]]}]
]

#5

Horaay Ines! I was totally backwards on this and you helped me out. I really appreciate it! Got things working now.