How do I export annotated training and test sets as csv?

Hello,

Is there a way to export the results of training as csv? For example, with a column for sentences, another for predicted entities, and another for actual entities?

After using ner.manual, I exported the results as json and then converted that to csv; I find the structure a little confusing however. I would be grateful if you could explain it to me or if we could export it in the format I described. These are a few of the columns i see in my converted csv:

Sincerely,
Tahia

The issue with .csv files is that they typically do not handle nested data very well. Since a sentence can have more than one entitiy, you'd typically end up with a list in one of the columns.

Libraries like pandas can read it, via something like:

df = pd.read_json(path, orient="records", lines=True)

But the df itself may be hard to deal with, because pandas is more designed for flat table datastructures.

Is there a specific reason why you need to have the data in .csv format? If I understand that better I might be able to give better advise.

Hi Koaning,

Thank you for your reply. My project involves correctly identifying entities from a series of text. After we have trained an NER model that can correctly identify the entities, we want to do a time series forecasting using the number and type of entities detected from each text input.

I am looking for a simple output file with the input text in one column, the predicted entities in another column, and if possible, the true entities in a third column. Is there a way to export this directly from pdf, in a csv format or otherwise?

Thank you so much for your diligent effort in replying to our forum queries. I hope to hear from you soon.

Sincerely,
Tahia

Prodigy does not offer methods to get this out of a pdf.

What if you do something like this?

import spacy
import srsly 

nlp = spacy.load("your_trained_model")
examples = srsly.read_jsonl("path/to/prodigy-annotation-export.jsonl")

tuples = (eg, eg['text'] for eg in examples)
for ex, doc in nlp.pipe(tuples, as_tuples):
    for ent in doc.ents:
        # prints text, annotation timestamp, the predicted entity and the annotate entities
        print(ex['text'], ex['_timestamp'], ent, ex['spans'])

I usually write small Python scripts for this sort of thing, would something like this work?

Apologies for the typo Koaning, I meant prodigy here, not pdf.

I think it could work; I received this error however:

Ah! Pardon, it should be as_tuples=True there. I checked the docs to make sure and I also spotted another mistake on my end.

To quote the setting for as_tuples:

If set to True, inputs should be a sequence of (text, context) tuples. Output will then be a sequence of (doc, context) tuples. Defaults to False.

So that means it should be more like this:


tuples = ((eg['text'], eg) for eg in examples)
for doc, ex in nlp.pipe(tuples, as_tuples):
    for ent in doc.ents:
        # prints text, annotation timestamp, the predicted entity and the annotate entities
        print(ex['text'], ex['_timestamp'], ent, ex['spans'])

Thank you so much! I set as_tuples=True and it worked exactly as I wanted.


tuples = ((eg['text'], eg) for eg in examples)
for doc, ex in nlp.pipe(tuples, as_tuples=True):
    for ent in doc.ents:
        # prints text, annotation timestamp, the predicted entity and the annotate entities
        print(ex['text'], ex['_timestamp'], ent, ex['spans'])
1 Like