Custom loaders

Hello,
In terms.py it is possible to have fast customs loaders.
Original code :

def train_vectors(output_model, source=None, loader=None, spacy_model=None,
              lang='xx', size=128, window=5, min_count=10, negative=5,
              n_iter=2, n_workers=4, merge_ents=False, merge_nps=False):
"""Train word vectors from a text source."""
log("RECIPE: Starting recipe terms.train-vectors", locals())
if spacy_model is None:
    nlp = spacy.blank(lang)
    print("Using blank spaCy model ({})".format(lang))
    nlp.add_pipe(nlp.create_pipe('sentencizer'))
    log("RECIPE: Added sentence boundary detector to blank model")
else:
    nlp = spacy.load(spacy_model)
if merge_ents:
    nlp.add_pipe(preprocess.merge_entities, name='merge_entities')
    log("RECIPE: Added pipeline component to merge entities")
if merge_nps:
    nlp.add_pipe(preprocess.merge_noun_chunks, name='merge_noun_chunks')
    log("RECIPE: Added pipeline component to merge noun chunks")
Word2Vec = get_word2vec()
if not output_model.exists():
    output_model.mkdir(parents=True)
    log("RECIPE: Created output directory")
sentences = SentenceIterator(nlp,
                lambda: get_stream(source, loader=loader, input_key='text'))
w2v = Word2Vec(sentences, size=size, window=window, min_count=min_count,
               sample=1e-5, iter=n_iter, workers=n_workers,
               negative=negative)
log("RECIPE: Resetting vectors with size {}".format(size))
nlp.vocab.reset_vectors(width=size)
log("RECIPE: Adding {} vectors to model vocab".format(len(w2v.wv.vocab)))
for word in w2v.wv.vocab:
    nlp.vocab.set_vector(word, w2v.wv.word_vec(word))
nlp.to_disk(output_model)
prints('Trained Word2Vec model', output_model.resolve())
return False

And have something like (change near SentenceIterator):

def train_vectors(output_model, source=None, loader=None, spacy_model=None,
              lang='xx', size=128, window=5, min_count=10, negative=5,
              n_iter=2, n_workers=4, merge_ents=False, merge_nps=False):
"""Train word vectors from a text source."""
log("RECIPE: Starting recipe terms.train-vectors", locals())
if spacy_model is None:
    nlp = spacy.blank(lang)
    print("Using blank spaCy model ({})".format(lang))
    nlp.add_pipe(nlp.create_pipe('sentencizer'))
    log("RECIPE: Added sentence boundary detector to blank model")
else:
    nlp = spacy.load(spacy_model)
if merge_ents:
    nlp.add_pipe(preprocess.merge_entities, name='merge_entities')
    log("RECIPE: Added pipeline component to merge entities")
if merge_nps:
    nlp.add_pipe(preprocess.merge_noun_chunks, name='merge_noun_chunks')
    log("RECIPE: Added pipeline component to merge noun chunks")
Word2Vec = get_word2vec()
if not output_model.exists():
    output_model.mkdir(parents=True)
    log("RECIPE: Created output directory")
if not callable(loader):
     sentences = SentenceIterator(nlp,
                    lambda: get_stream(source, loader=loader, input_key='text'))
else:  
      sentences = SentenceIterator(nlp, loader)
w2v = Word2Vec(sentences, size=size, window=window, min_count=min_count,
               sample=1e-5, iter=n_iter, workers=n_workers,
               negative=negative)
log("RECIPE: Resetting vectors with size {}".format(size))
nlp.vocab.reset_vectors(width=size)
log("RECIPE: Adding {} vectors to model vocab".format(len(w2v.wv.vocab)))
for word in w2v.wv.vocab:
    nlp.vocab.set_vector(word, w2v.wv.word_vec(word))
nlp.to_disk(output_model)
prints('Trained Word2Vec model', output_model.resolve())
return False

So we can do an import and run it.
Or even better : write code in a file, use the -F option and put the loader name we gave to the function (but I don’t know how :/)

I hope I understand your question correctly – but the latest version of Prodigy supports providing custom loaders via Python entry points. Prodigy will automatically check for entry points registered as prodigy_loaders and will then allow you to use them via --loader some_loader_name etc. You can find more details on how this works in your PRODIGY_README.html.

Here's the relevant section:

2 Likes

Hi Ines,

I was wondering if you have any recommendations on loading ‘text’ from a (Azure) SQL DB, containing preprocessed paragraphs for labeling and traing a classifier?

How many results can Prodigy handle when considering that a query returns a large result, should the query results be served in batches to the custom_api_loader?

Cheers,
Seb

@seb I’m no SQL database expert, but as far as Prodigy is concerned, it shouldn’t matter that much. Prodigy will consume the examples from the stream generator in batches, so I guess it makes sense to also query your DB in batches. For example, load a few hundred rows, loop over them and yield a dict with the text. Then request the next few hundred once the previous examples are consumed. At least, this seems more reasonable to me than loading everything once and then keeping it all in memory.

Just did a quick Google search and the Python code example in the Azure docs looks surprisingly straightforward. So I hope this won’t be too difficult to implement :slightly_smiling_face:

Thanks Ines! Yeah, I was thinking about doing it that way. Cheers

1 Like

Is this still accurate in 2024? I want to add a custom loader via entry points, and cannot find anything in the documentation.

Hi @nrodnova,

Yes, the entry points are still in place so you can use it to register a custom loader (although you might get a deprecation warning as we've reimplemented the entire Streamclass and custom loaders still belong to the old implementation).
You can find some information on entry points here and some more on writing custom loaders here. Let us know if it's not sufficient and you need some extra details, but the structure of the setup.cfg in Ines' answer is still relevant.