Split a ner.manual dataset, into smaller texts

Hi @dave-espinosa!

Thanks for your question!

I think I found a solution for #1:

I ran your code (fyi you're missing from spacy.tokens import Span) and ignored the last two lines as you can skip the intermediate step of saving the binary to disk.

import srsly

examples = []  # examples in Prodigy's format
for doc in docs: # loop through docs
    spans = [{"start": ent.start_char, "end": ent.end_char, "label": ent.label_} for ent in doc.ents]
    examples.append({"text": doc.text, "spans": spans, "answer": "accept"})

srsly.write_jsonl("data.jsonl", examples)

This is nearly identical to the link above -- but the one addition was I added a "answer": "accept" for each example if you want to accept all the examples (which may not be true). You can then load the file using db-in. Can you confirm if this solves your problem?

I think you can shrink this code more by replacing the srsly.write_jsonl and db-in step and load the examples directly into Prodigy's database using the database.add_examples method.

Were you splitting your text by sentences? If so, did you see the split_sentences function in prodigy?

from prodigy.components.preprocess import split_sentences
import spacy

nlp = spacy.load("en_core_web_sm")
stream = [{"text": "spaCy is a library. It is written in Python."}]
stream = split_sentences(nlp, stream, min_length=30)

This returns a generator (stream) so it is best to use within a custom recipe but you can use list() on them to convert to a list. This may be helpful if you need to split sentences on your own. As you may have noticed, ner.manual doesn't do the sentence splitting; however, ner.correct and ner.teach do automatically by using split_sentences(). Hope this helps for future purposes!

Unfortunately, I'm not aware. Sounds like it would be an interesting experiment.