Split a ner.manual dataset, into smaller texts

ryanwesslen · June 22, 2022, 7:42pm

Thanks for your question!

I think I found a solution for #1:

I ran your code (fyi you're missing from spacy.tokens import Span) and ignored the last two lines as you can skip the intermediate step of saving the binary to disk.

import srsly

examples = []  # examples in Prodigy's format
for doc in docs: # loop through docs
    spans = [{"start": ent.start_char, "end": ent.end_char, "label": ent.label_} for ent in doc.ents]
    examples.append({"text": doc.text, "spans": spans, "answer": "accept"})

srsly.write_jsonl("data.jsonl", examples)

This is nearly identical to the link above -- but the one addition was I added a "answer": "accept" for each example if you want to accept all the examples (which may not be true). You can then load the file using db-in. Can you confirm if this solves your problem?

I think you can shrink this code more by replacing the srsly.write_jsonl and db-in step and load the examples directly into Prodigy's database using the database.add_examples method.

Were you splitting your text by sentences? If so, did you see the split_sentences function in prodigy?

from prodigy.components.preprocess import split_sentences
import spacy

nlp = spacy.load("en_core_web_sm")
stream = [{"text": "spaCy is a library. It is written in Python."}]
stream = split_sentences(nlp, stream, min_length=30)

This returns a generator (stream) so it is best to use within a custom recipe but you can use list() on them to convert to a list. This may be helpful if you need to split sentences on your own. As you may have noticed, ner.manual doesn't do the sentence splitting; however, ner.correct and ner.teach do automatically by using split_sentences(). Hope this helps for future purposes!

Unfortunately, I'm not aware. Sounds like it would be an interesting experiment.

Topic		Replies	Views
Prodigy annotations to SpaCy train spacy	13	5622	January 31, 2018
Create a dataset out of many txt_files documents (Best Practice) usage , ner , best-practices	4	1836	March 30, 2021
Training NER models with synthetic data sets usage , ner , spacy , solved	13	2963	August 26, 2019
NER document Labeling ner , solved	25	3690	August 1, 2019
ner.train on data not annotated by Spacy? ner	3	1152	June 11, 2018

Split a ner.manual dataset, into smaller texts

Related topics