Passing array into prodigy.serve()

Hello

I try to automate a complete training pipeline using python (without having to launch a bunch of command lines each time).

If I use this function, it works well:

prodigy.serve('ner.teach', 'fr_PRODUCT', 'fr_core_news_sm', 'test.jsonl',
                  None, None, ['PRODUCT'], None, None)

But is it possible to pass an object (ie array of strings) and not a jsonl file ?

I have a function like this that I could use if possible.

def create_jsonl(sentences):
    results = []
    for s in sentences:
        results.append({'text':s, "meta":{"source":"My database"}})

    return '\n'.join([json.dumps(line, ensure_ascii=False) for line in results])

Thanks again for your help

I think I found it :slightly_smiling_face: :

prodigy.serve('custom.ner.teach', 'fr_PRODUCT', 'fr_core_news_sm', sentences, ['PRODUCT'])

@prodigy.recipe('custom.ner.teach',
                dataset=prodigy.recipe_args['dataset'],
                spacy_model=prodigy.recipe_args['spacy_model'],
                database=("Database to connect to", "positional", None, str),
                label=prodigy.recipe_args['label_set'])
def custom_ner_teach(dataset, spacy_model, database, label=None):
    stream = ({'text': row} for row in database)
    components = teach(dataset=dataset, spacy_model=spacy_model,
                       source=stream, label=label)
    return components

Is it correct ?

1 Like

Yes, that’s the first thing I would have suggested, too :slightly_smiling_face:

1 Like

Thanks Ines, I’ll be ready to do the Prodigy support soon :smile:

I’m digging into this and i would like to add patterns to my recipe.

I managed to do that using :

@prodigy.recipe('custom.ner.teach',
                dataset=prodigy.recipe_args['dataset'],
                spacy_model=prodigy.recipe_args['spacy_model'],
                database=("Database to connect to", "positional", None, str),
                label=prodigy.recipe_args['label_set'],
                patterns=prodigy.recipe_args['patterns'])
def custom_ner_teach(dataset, spacy_model, database, label, patterns):
    stream = ({'text': row} for row in database)
    components = teach(dataset=dataset, spacy_model=spacy_model,
                       source=stream, label=label, patterns=patterns)
    return components

But is there a way to pass an object to patterns (array) and not a file ?

Thanks

I dug a little bit more and it seems that there is no way to pass pattern directly. So I will export to a temporary file.

Maybe it could be an option to add in a future release !

Yep, at the moment, it expects to load those from a file.

It's possible that you're at a point now where you find it easier to write your own recipe based on the built-in ner.teach. Then you can load things however you want and also add some of your own logic.

See this page for more details on custom recipes. Also, if you haven't seen it yet, this repo has various recipe scripts, including slightly simplified versions of the built-in recipes with explanations, so you can see what's going on:

The ner.teach example:

1 Like

Great idea ! Thank you Ines

Hi

I’m a little confused about this decorator thing.

I created a custom ner_teach.py file :

@prodigy.recipe('my.ner.teach',
                dataset=("The dataset to use", "positional", None, str),
                spacy_model=("The base model", "positional", None, str),
                source=("The source data as a JSONL file", "positional", None, str),
                label=("One or more comma-separated labels", "option", "l", split_string),
                patterns=("Optional match patterns", "option", "p", str),
                exclude=("Names of datasets to exclude", "option", "e", split_string),
                unsegmented=("Don't split sentences", "flag", "U", bool)
                )
def my_ner_teach(dataset, spacy_model, database, label, patterns, exclude=None):

    stream = ({'text': row} for row in database)

   nlp = spacy.load(spacy_model)
    model = EntityRecognizer(nlp, label=label)

    matcher = PatternMatcher(nlp).add_patterns(patterns)
    # Combine the NER model and the matcher and interleave their
    # suggestions and update both at the same time
    predict, update = combine_models(model, matcher)

    # Use the prefer_uncertain sorter to focus on suggestions that the model
    # is most uncertain about (i.e. with a score closest to 0.5). The model
    # yields (score, example) tuples and the sorter yields just the example
    stream = prefer_uncertain(predict(stream))

    return {
        'view_id': 'ner',  # Annotation interface to use
        'dataset': dataset,  # Name of dataset to save annotations
        'stream': stream,  # Incoming stream of examples
        'update': update,  # Update callback, called with batch of answers
        'exclude': exclude,  # List of dataset names to exclude
        'config': {  # Additional config settings, mostly for app UI
            'lang': nlp.lang,
            'label': ', '.join(label) if label is not None else 'all'
        }
    }

In this file, I add a list of patterns. using PatternMatcher(nlp).add_patterns(patterns). Hope it will work…

But how do I call this function (from my main function) ?

I have this, but of course the "prodigy.serve() cannot find ‘my.ner.teach’

import prodigy
import ner_teach
      
prodigy.serve('my.ner.teach', lang + '_' + label, 'fr_core_news_sm',
              sentences, [label], patterns)

Thanks

The @prodigy.recipe decorator registers the recipe so you can refer to it by its string name my.ner.teach. But in order to do that, it needs to be run.

In the file you’re calling prodigy.serve, are you actually importing the my_ner_teach recipe function?

Ok, I’m not far away…

I have the ner_teach.pyfile :

import prodigy
import spacy

from prodigy.components.sorters import prefer_uncertain
from prodigy.models.matcher import PatternMatcher
from prodigy.models.ner import EntityRecognizer
from prodigy.util import combine_models, split_string


@prodigy.recipe('sentinel.ner.teach',
                dataset=("The dataset to use", "positional", None, str),
                spacy_model=("The base model", "positional", None, str),
                database=("The source data as a JSONL file", "positional", None, str),
                label=("One or more comma-separated labels", "option", "l", split_string),
                patterns=("Optional match patterns", "option", "p", str)
                )
def sentinel_ner_teach(dataset, spacy_model, database, label, patterns):

    print(database)
    stream = ({'text': row} for row in database)

    nlp = spacy.load(spacy_model)
    model = EntityRecognizer(nlp, label=label)

    print(patterns)
    matcher = PatternMatcher(nlp).add_patterns(patterns)
    predict, update = combine_models(model, matcher)

    # predict = model
    # update = model.update
    stream = prefer_uncertain(predict(stream))

    return {
        'view_id': 'ner',
        'dataset': dataset,
        'stream': stream,
        'update': update,
        'config': {
            'lang': nlp.lang,
            'label': ', '.join(label) if label is not None else 'all'
        }
    }

And I call it from another file :

from sentinel.ml.ner_teach import sentinel_ner_teach
(...)
prodigy.serve('sentinel.ner.teach', my_dataset, 'fr_core_news_sm', sentences, [label], patterns)

First problem, my IDE (pyCharm) says that the import is unused and want to remove it. An idea to corrige that ?

Second problem, I have this error :

 Exception when serving /get_questions
(...)
ValueError: Error while validating stream: no first example. This likely means that your stream is empty.

As you see in the first code, if I print `pattern, I get :

[{'label': 'ORG', 'pattern': [{'lower': 'lydia'}]}, {'label': 'PRODUCT', 'pattern': [{'lower': 'google'}, {'lower': 'hangouts'}, {'lower': 'chat'}]}, {'label': 'PRODUCT', 'pattern': [{'lower': 'watchos'}]}, {'label': 'PRODUCT', 'pattern': [{'lower': 'amazon'}, {'lower': 'fresh'}]}, {'label': 'ORG', 'pattern': [{'lower': 'bain'}, {'lower': 'capital'}, {'lower': 'ventures'}]}, {'label': 'PRODUCT', 'pattern': [{'lower': 'apple'}, {'lower': 'news'}]}, {'label': 'PRODUCT', 'pattern': [{'lower': 'microsoft'}, {'lower': 'windows'}, {'lower': '10'}]}, {'label': 'PRODUCT', 'pattern': [{'lower': 'apple'}, {'lower': 'pay'}]}]

And dataset contains a list of sentences:

['La liaison entre la Ami One et les smartphones est annoncée comme important','Un point important à retenir est que la Ami One est une voiture sans permis.', 'Citroën affirme cependant que son véhicule\xa0«\xa0dispose de sa propre signature sonore', 'Citroën a tout de même eu la bonne idée d’installer des sièges légèrement décalés l’un de l’autre, afin d’éviter de gêner les mouvements du conducteur.']

I’ll continue to search, but If you have an idea…

I mean, this is just a warning and your editor trying to help you write better code. In this case, the unused import is intentional. I don't know much about PyCharm, but linters usually let you add a comment to the line to specify that what you did there was intentional and that it should stop warning you about it. The standard Python way of doing this is adding # noqa, but it might be # noinspection in PyCharm?

This is trying to tell you that the stream of incoming examples is invalid. So that's whatever is insentences (and later used in your recipe as database).

So for some reason, what the recipe returns as the stream is empty. If the data you're passing in isn't empty, it's likely that the model and patterns with the given labels do not produce any candidates to send out.

Thanks Ines

For the import, I added # noinspection PyUnresolvedReferences and it works.

I checked the examples and inserted this code to test

texts = ['This is a text about David Bowie', 'Apple makes iPhones']
stream = [{'text': text} for text in texts]

patterns = [{'label': 'PERSON', 'pattern': 'David Bowie'},
             {'label': 'ORG', 'pattern': [{'lower': 'apple'}]}]

But same result.
I tried to set the recipe before calling it :

prodigy.set_recipe('sentinel.ner.teach', ner_teach.sentinel_ner_teach)
prodigy.serve('sentinel.ner.teach', my_dataset, 'fr_core_news_sm', sentences, [label], patterns)

Same error…

I think I do not import the decorator function the right way and maybe I have another problem somewhere…

Ok,

If I use a file for patterns, it’s fine :

matcher = PatternMatcher(nlp).from_disk(patterns_file)

But it seems that using a list is not working :

patterns_list = [{'label': 'PERSON', 'pattern': 'David Bowie'},
             {'label': 'ORG', 'pattern': [{'lower': 'apple'}]}]
matcher = PatternMatcher(nlp).add_patterns(patterns_list)

Any idea how to use add_patterns() method (or another method) ?

Thanks

I think the problem with your code here is that your variable matcher now becomes the return value of PatternMatcher.add_patterns. Unlike from_disk (which always returns the object itself), add_patterns is just a regular method that adds the patterns and returns nothing. So you want to be doing something like this:

matcher = PatternMatcher(nlp)  # this is now the actual matcher
matcher.add_patterns(patterns_list)
1 Like

Thanks a lot Ines, it seems to work now !

I continue my implementation :slight_smile: