I am using Prodigy to train a NER model, with 4 different labels. In my first approach, I did not obtain good-enough results, and after checking this flowchart, I realized that maybe, it would be worth trying to train another model, with smaller texts this time. I used Prodigy's
ner.manual to build my original training dataset (with "large" texts), obtaining a jsonlines file as result, in which each sample is a dictionary with
spans (not available if the text was rejected or ignored),
_timestamp, ant their corresponding values . I then built my own function (as I did not find any function in either Prodigy or spaCy), to "split" the
text from the previously mentioned dataset, into smaller ones (while modifying accordingly the Spans on the labeled entities). What I have now, looks like this (code borrowed from your "Spacy Advanced Course, chapter 4"):
import spacy from spacy.tokens import DocBin, Span nlp = spacy.blank("en") # New "smaller" Docs with updated entity spans doc1 =nlp("Python and Java, are two very important assets in the role.") doc1.ents = [Span(doc1 , 0 , 1 , label="LABEL1"), Span(doc1, 2 , 3 , "LABEL2")] doc2 = nlp("As a data scientist, you are expected to do that and beyond.") doc2.ents = [Span(doc2, 2 , 4 , label="LABEL3")] doc3 = nlp("I need a new phone! Any tips?") doc3.ents =  docs = [doc1, doc2, doc3] # With all the docs... dataset = DocBin(docs=docs) dataset .to_disk("./dataset.spacy")
But now, I am a bit lost, more specifically:
- How to "convert" that
dataset.spacyfile, into something compatible with Prodigy? Is there a more direct approach in what I am doing here? (I could convert what I currently have into
ner_manualinstead, but then I don't know how to use the generated jsonlines, with
db-in(the documentation does not include any example, and the Description is not quite self-explanatory).
- Is there any "recommended ratio" between samples "with and without" entity spans?