Is there a way to export the results of training as csv? For example, with a column for sentences, another for predicted entities, and another for actual entities?
After using ner.manual, I exported the results as json and then converted that to csv; I find the structure a little confusing however. I would be grateful if you could explain it to me or if we could export it in the format I described. These are a few of the columns i see in my converted csv:
The issue with .csv files is that they typically do not handle nested data very well. Since a sentence can have more than one entitiy, you'd typically end up with a list in one of the columns.
Libraries like pandas can read it, via something like:
Thank you for your reply. My project involves correctly identifying entities from a series of text. After we have trained an NER model that can correctly identify the entities, we want to do a time series forecasting using the number and type of entities detected from each text input.
I am looking for a simple output file with the input text in one column, the predicted entities in another column, and if possible, the true entities in a third column. Is there a way to export this directly from pdf, in a csv format or otherwise?
Thank you so much for your diligent effort in replying to our forum queries. I hope to hear from you soon.
Prodigy does not offer methods to get this out of a pdf.
What if you do something like this?
import spacy
import srsly
nlp = spacy.load("your_trained_model")
examples = srsly.read_jsonl("path/to/prodigy-annotation-export.jsonl")
tuples = (eg, eg['text'] for eg in examples)
for ex, doc in nlp.pipe(tuples, as_tuples):
for ent in doc.ents:
# prints text, annotation timestamp, the predicted entity and the annotate entities
print(ex['text'], ex['_timestamp'], ent, ex['spans'])
I usually write small Python scripts for this sort of thing, would something like this work?
Ah! Pardon, it should be as_tuples=True there. I checked the docs to make sure and I also spotted another mistake on my end.
To quote the setting for as_tuples:
If set to True, inputs should be a sequence of (text, context) tuples. Output will then be a sequence of (doc, context) tuples. Defaults to False.
So that means it should be more like this:
tuples = ((eg['text'], eg) for eg in examples)
for doc, ex in nlp.pipe(tuples, as_tuples):
for ent in doc.ents:
# prints text, annotation timestamp, the predicted entity and the annotate entities
print(ex['text'], ex['_timestamp'], ent, ex['spans'])
Thank you so much! I set as_tuples=True and it worked exactly as I wanted.
tuples = ((eg['text'], eg) for eg in examples)
for doc, ex in nlp.pipe(tuples, as_tuples=True):
for ent in doc.ents:
# prints text, annotation timestamp, the predicted entity and the annotate entities
print(ex['text'], ex['_timestamp'], ent, ex['spans'])