Custom loaders

usage
(Grégory Howard) #1

Hello,
In terms.py it is possible to have fast customs loaders.
Original code :

def train_vectors(output_model, source=None, loader=None, spacy_model=None,
              lang='xx', size=128, window=5, min_count=10, negative=5,
              n_iter=2, n_workers=4, merge_ents=False, merge_nps=False):
"""Train word vectors from a text source."""
log("RECIPE: Starting recipe terms.train-vectors", locals())
if spacy_model is None:
    nlp = spacy.blank(lang)
    print("Using blank spaCy model ({})".format(lang))
    nlp.add_pipe(nlp.create_pipe('sentencizer'))
    log("RECIPE: Added sentence boundary detector to blank model")
else:
    nlp = spacy.load(spacy_model)
if merge_ents:
    nlp.add_pipe(preprocess.merge_entities, name='merge_entities')
    log("RECIPE: Added pipeline component to merge entities")
if merge_nps:
    nlp.add_pipe(preprocess.merge_noun_chunks, name='merge_noun_chunks')
    log("RECIPE: Added pipeline component to merge noun chunks")
Word2Vec = get_word2vec()
if not output_model.exists():
    output_model.mkdir(parents=True)
    log("RECIPE: Created output directory")
sentences = SentenceIterator(nlp,
                lambda: get_stream(source, loader=loader, input_key='text'))
w2v = Word2Vec(sentences, size=size, window=window, min_count=min_count,
               sample=1e-5, iter=n_iter, workers=n_workers,
               negative=negative)
log("RECIPE: Resetting vectors with size {}".format(size))
nlp.vocab.reset_vectors(width=size)
log("RECIPE: Adding {} vectors to model vocab".format(len(w2v.wv.vocab)))
for word in w2v.wv.vocab:
    nlp.vocab.set_vector(word, w2v.wv.word_vec(word))
nlp.to_disk(output_model)
prints('Trained Word2Vec model', output_model.resolve())
return False

And have something like (change near SentenceIterator):

def train_vectors(output_model, source=None, loader=None, spacy_model=None,
              lang='xx', size=128, window=5, min_count=10, negative=5,
              n_iter=2, n_workers=4, merge_ents=False, merge_nps=False):
"""Train word vectors from a text source."""
log("RECIPE: Starting recipe terms.train-vectors", locals())
if spacy_model is None:
    nlp = spacy.blank(lang)
    print("Using blank spaCy model ({})".format(lang))
    nlp.add_pipe(nlp.create_pipe('sentencizer'))
    log("RECIPE: Added sentence boundary detector to blank model")
else:
    nlp = spacy.load(spacy_model)
if merge_ents:
    nlp.add_pipe(preprocess.merge_entities, name='merge_entities')
    log("RECIPE: Added pipeline component to merge entities")
if merge_nps:
    nlp.add_pipe(preprocess.merge_noun_chunks, name='merge_noun_chunks')
    log("RECIPE: Added pipeline component to merge noun chunks")
Word2Vec = get_word2vec()
if not output_model.exists():
    output_model.mkdir(parents=True)
    log("RECIPE: Created output directory")
if not callable(loader):
     sentences = SentenceIterator(nlp,
                    lambda: get_stream(source, loader=loader, input_key='text'))
else:  
      sentences = SentenceIterator(nlp, loader)
w2v = Word2Vec(sentences, size=size, window=window, min_count=min_count,
               sample=1e-5, iter=n_iter, workers=n_workers,
               negative=negative)
log("RECIPE: Resetting vectors with size {}".format(size))
nlp.vocab.reset_vectors(width=size)
log("RECIPE: Adding {} vectors to model vocab".format(len(w2v.wv.vocab)))
for word in w2v.wv.vocab:
    nlp.vocab.set_vector(word, w2v.wv.word_vec(word))
nlp.to_disk(output_model)
prints('Trained Word2Vec model', output_model.resolve())
return False

So we can do an import and run it.
Or even better : write code in a file, use the -F option and put the loader name we gave to the function (but I don’t know how :/)

(Ines Montani) #2

I hope I understand your question correctly – but the latest version of Prodigy supports providing custom loaders via Python entry points. Prodigy will automatically check for entry points registered as prodigy_loaders and will then allow you to use them via --loader some_loader_name etc. You can find more details on how this works in your PRODIGY_README.html.

Here’s the relevant section:

#3

Hi Ines,

I was wondering if you have any recommendations on loading ‘text’ from a (Azure) SQL DB, containing preprocessed paragraphs for labeling and traing a classifier?

How many results can Prodigy handle when considering that a query returns a large result, should the query results be served in batches to the custom_api_loader?

Cheers,
Seb

(Ines Montani) #4

@seb I’m no SQL database expert, but as far as Prodigy is concerned, it shouldn’t matter that much. Prodigy will consume the examples from the stream generator in batches, so I guess it makes sense to also query your DB in batches. For example, load a few hundred rows, loop over them and yield a dict with the text. Then request the next few hundred once the previous examples are consumed. At least, this seems more reasonable to me than loading everything once and then keeping it all in memory.

Just did a quick Google search and the Python code example in the Azure docs looks surprisingly straightforward. So I hope this won’t be too difficult to implement :slightly_smiling_face:

#5

Thanks Ines! Yeah, I was thinking about doing it that way. Cheers

1 Like