Custom loaders

Gregory-Howard · September 5, 2018, 2:37pm

Hello,
In terms.py it is possible to have fast customs loaders.
Original code :

def train_vectors(output_model, source=None, loader=None, spacy_model=None,
              lang='xx', size=128, window=5, min_count=10, negative=5,
              n_iter=2, n_workers=4, merge_ents=False, merge_nps=False):
"""Train word vectors from a text source."""
log("RECIPE: Starting recipe terms.train-vectors", locals())
if spacy_model is None:
    nlp = spacy.blank(lang)
    print("Using blank spaCy model ({})".format(lang))
    nlp.add_pipe(nlp.create_pipe('sentencizer'))
    log("RECIPE: Added sentence boundary detector to blank model")
else:
    nlp = spacy.load(spacy_model)
if merge_ents:
    nlp.add_pipe(preprocess.merge_entities, name='merge_entities')
    log("RECIPE: Added pipeline component to merge entities")
if merge_nps:
    nlp.add_pipe(preprocess.merge_noun_chunks, name='merge_noun_chunks')
    log("RECIPE: Added pipeline component to merge noun chunks")
Word2Vec = get_word2vec()
if not output_model.exists():
    output_model.mkdir(parents=True)
    log("RECIPE: Created output directory")
sentences = SentenceIterator(nlp,
                lambda: get_stream(source, loader=loader, input_key='text'))
w2v = Word2Vec(sentences, size=size, window=window, min_count=min_count,
               sample=1e-5, iter=n_iter, workers=n_workers,
               negative=negative)
log("RECIPE: Resetting vectors with size {}".format(size))
nlp.vocab.reset_vectors(width=size)
log("RECIPE: Adding {} vectors to model vocab".format(len(w2v.wv.vocab)))
for word in w2v.wv.vocab:
    nlp.vocab.set_vector(word, w2v.wv.word_vec(word))
nlp.to_disk(output_model)
prints('Trained Word2Vec model', output_model.resolve())
return False

And have something like (change near SentenceIterator):

def train_vectors(output_model, source=None, loader=None, spacy_model=None,
              lang='xx', size=128, window=5, min_count=10, negative=5,
              n_iter=2, n_workers=4, merge_ents=False, merge_nps=False):
"""Train word vectors from a text source."""
log("RECIPE: Starting recipe terms.train-vectors", locals())
if spacy_model is None:
    nlp = spacy.blank(lang)
    print("Using blank spaCy model ({})".format(lang))
    nlp.add_pipe(nlp.create_pipe('sentencizer'))
    log("RECIPE: Added sentence boundary detector to blank model")
else:
    nlp = spacy.load(spacy_model)
if merge_ents:
    nlp.add_pipe(preprocess.merge_entities, name='merge_entities')
    log("RECIPE: Added pipeline component to merge entities")
if merge_nps:
    nlp.add_pipe(preprocess.merge_noun_chunks, name='merge_noun_chunks')
    log("RECIPE: Added pipeline component to merge noun chunks")
Word2Vec = get_word2vec()
if not output_model.exists():
    output_model.mkdir(parents=True)
    log("RECIPE: Created output directory")
if not callable(loader):
     sentences = SentenceIterator(nlp,
                    lambda: get_stream(source, loader=loader, input_key='text'))
else:  
      sentences = SentenceIterator(nlp, loader)
w2v = Word2Vec(sentences, size=size, window=window, min_count=min_count,
               sample=1e-5, iter=n_iter, workers=n_workers,
               negative=negative)
log("RECIPE: Resetting vectors with size {}".format(size))
nlp.vocab.reset_vectors(width=size)
log("RECIPE: Adding {} vectors to model vocab".format(len(w2v.wv.vocab)))
for word in w2v.wv.vocab:
    nlp.vocab.set_vector(word, w2v.wv.word_vec(word))
nlp.to_disk(output_model)
prints('Trained Word2Vec model', output_model.resolve())
return False

So we can do an import and run it.
Or even better : write code in a file, use the -F option and put the loader name we gave to the function (but I don’t know how :/)

ines · September 5, 2018, 7:33pm

I hope I understand your question correctly – but the latest version of Prodigy supports providing custom loaders via Python entry points. Prodigy will automatically check for entry points registered as prodigy_loaders and will then allow you to use them via --loader some_loader_name etc. You can find more details on how this works in your PRODIGY_README.html.

Here's the relevant section:

Entry Points

Plugging in your own loaders or database connectors usually required writing a custom recipe – even if you don't want to change anything about the built-in
recipes themselves. Prodigy v1.5.0 introduces a new way of making your own
functions available to built-in recipes and CLI commands, without having to
modify the source. All you need is a simple Python package that exposes your
components via entry_points, and is installed in the same environment as
Prodigy. For a quick introduction to entry points in Python, we recommend
this blog post.

Consider the following structure of your prodigy_utils package:
└── prodigy_utils       # package directory
    ├── recipes.py      # recipe functions
    ├── loaders.py      # loader functions
    ├── db.py           # database classes
    └── setup.py        # package setup
In the setup.py, you can then define a dictionary of entry_points for one
or more of the available categories. Each category is mapped to a list of
strings in the format [name] = module:function. For example,
'custom_json = loaders:CustomJSON' will make the function or class
CustomJSON in the file loaders.py available via the name custom_json.
from setuptools import setup

setup(
    name='prodigy_utils',
    entry_points={
        'prodigy_recipes': [
            'custom_recipe = recipes:custom_recipe'
        ],
        'prodigy_loaders': [
            'database = loaders:DatabaseLoader',
            'custom_json = loaders:CustomJSON'
        ],
        'prodigy_db': [
            'mongodb = db:MongoDBLoader'
        ]
    },
    requirements=[
        'prodigy>=1.4.3,<1.5.0'
    ]
)
Prodigy checks the following entry point categories:

Name Description

prodigy_recipes Custom recipe functions, one entry per recipe.

prodigy_loaders File or API loaders.

prodigy_db Database connectors that follow the same API as Prodigy's Database class.

To install your package and expose the entry points, navigate to the package
directory and run the setup – for example, in development mode:
cd prodigy_utils
python setup.py develop
If your package is installed in the same environment, you won't have to do
anything else. Prodigy will automatically find and load your entry points. To
verify that your entry points were read in correctly, you can set the
PRODIGY_LOGGING=basic environment variable. On startup, Prodigy will log how
many components were added via entry points.

You can then refer to your custom components by name, for example:
prodigy custom_recipe your_dataset
prodigy ner.teach your_dataset en_core_web_sm data.json --loader custom_json
File loader API

All file loaders follow the same API. At a minimum, your custom loader needs to
be a function that takes a source argument and yields annotation tasks. Of
course, you can also implement a more complex class that exposes an __iter__
method.
def loader(source):
    with open(source) as f:
        lines = f.read()
    for line in lines:
        yield {'text': line}
The source is what you pass in as the source argument on the command line. It
can be the path to a file, but also any other string, like database connection
parameters or an API query. Here are some fictional examples:
prodigy ner.teach your_set en_core_web_sm file.pdf --loader pdf
prodigy ner.teach your_set en_core_web_sm "xxx.amazonaws.com:1433" --loader aws
In your custom loaders, you can also use Prodigy's
get_config helper to get access to your
prodigy.json. This can be useful if you want to store your API keys or
auth tokens in one place. Here's an example of a loader that makes an API
request for a given query, and uses an API key stored in the Prodigy config:
import requests
from prodigy import get_config

def custom_api_loader(source):
    config = get_config()
    key = config['api_keys']['custom_api']
    res = requests.get('https://your-api', params={'query': source', key': key})
    data = response.json()['data']
    for entry in result:
        yield {'text': entry['text']}

seb · May 14, 2019, 3:59pm

Hi Ines,

I was wondering if you have any recommendations on loading ‘text’ from a (Azure) SQL DB, containing preprocessed paragraphs for labeling and traing a classifier?

How many results can Prodigy handle when considering that a query returns a large result, should the query results be served in batches to the custom_api_loader?

Cheers,
Seb

ines · May 14, 2019, 4:14pm

@seb I’m no SQL database expert, but as far as Prodigy is concerned, it shouldn’t matter that much. Prodigy will consume the examples from the stream generator in batches, so I guess it makes sense to also query your DB in batches. For example, load a few hundred rows, loop over them and yield a dict with the text. Then request the next few hundred once the previous examples are consumed. At least, this seems more reasonable to me than loading everything once and then keeping it all in memory.

Just did a quick Google search and the Python code example in the Azure docs looks surprisingly straightforward. So I hope this won’t be too difficult to implement

seb · May 15, 2019, 10:36am

Thanks Ines! Yeah, I was thinking about doing it that way. Cheers

nrodnova · August 14, 2024, 3:54pm

Is this still accurate in 2024? I want to add a custom loader via entry points, and cannot find anything in the documentation.

magdaaniol · August 16, 2024, 9:52am

Hi @nrodnova,

Yes, the entry points are still in place so you can use it to register a custom loader (although you might get a deprecation warning as we've reimplemented the entire Streamclass and custom loaders still belong to the old implementation).
You can find some information on entry points here and some more on writing custom loaders here. Let us know if it's not sufficient and you need some extra details, but the structure of the setup.cfg in Ines' answer is still relevant.

Topic		Replies	Views
Custom vectors loading issue spacy	2	899	January 22, 2020
Add vectors to nlp model using terms.train-vectors terms , solved	4	1294	April 10, 2018
Workflow re: Custom Sense2Vec on New Data ner , textcat , spacy	10	2728	April 20, 2020
Load error after adding custom textcat model to the pipeline textcat , spacy	7	2081	June 26, 2019
Error while loading the custom Text classification model in python textcat , spacy	1	811	June 20, 2019

Name	Description
`prodigy_recipes`	Custom recipe functions, one entry per recipe.
`prodigy_loaders`	File or API loaders.
`prodigy_db`	Database connectors that follow the same API as Prodigy's `Database` class.

Custom loaders

Related topics