Hi @dave-espinosa!
Thanks for your question!
I think I found a solution for #1:
I ran your code (fyi you're missing from spacy.tokens import Span
) and ignored the last two lines as you can skip the intermediate step of saving the binary to disk.
import srsly
examples = [] # examples in Prodigy's format
for doc in docs: # loop through docs
spans = [{"start": ent.start_char, "end": ent.end_char, "label": ent.label_} for ent in doc.ents]
examples.append({"text": doc.text, "spans": spans, "answer": "accept"})
srsly.write_jsonl("data.jsonl", examples)
This is nearly identical to the link above -- but the one addition was I added a "answer": "accept"
for each example if you want to accept all the examples (which may not be true). You can then load the file using db-in
. Can you confirm if this solves your problem?
I think you can shrink this code more by replacing the srsly.write_jsonl
and db-in
step and load the examples
directly into Prodigy's database using the database.add_examples
method.
Were you splitting your text by sentences? If so, did you see the split_sentences function in prodigy?
from prodigy.components.preprocess import split_sentences
import spacy
nlp = spacy.load("en_core_web_sm")
stream = [{"text": "spaCy is a library. It is written in Python."}]
stream = split_sentences(nlp, stream, min_length=30)
This returns a generator (stream) so it is best to use within a custom recipe but you can use list()
on them to convert to a list. This may be helpful if you need to split sentences on your own. As you may have noticed, ner.manual
doesn't do the sentence splitting; however, ner.correct
and ner.teach
do automatically by using split_sentences()
. Hope this helps for future purposes!
Unfortunately, I'm not aware. Sounds like it would be an interesting experiment.