How to run Prodigy commands from python script instead of cmd?
Yes, that’s possible – check out the documentation of the prodigy.serve
function in your PRODIGY_README.html
. The function is currently very simple, because we weren’t sure how useful it’d be to people – so you currently have to pass in all recipe arguments in order as positional arguments (or None
if you don’t want to set them). For example:
import prodigy
prodigy.serve('ner.teach', 'dataset', 'en_core_web_sm', 'data.jsonl',
None, None, ['PERSON', 'ORG'], None, None)
Alternatively, you can also call Prodigy in a subprocess
, or use a library like fabric3
(the Python 3-compatible fork of Fabric) to build more complex command pipelines. The solution you choose really depends on what you’re trying to do, and what workflow you prefer.
Great!!!Thanks.
How to use db-in from python script?
@akshitasood63 I moved your question here because it fits better in this topic than the one on catastrophic forgetting.
You can check out the source of the db-in
command in __main__.py
, or see the PRODIGY_README.html
for the API docs of the database methods. This will let you interact with the database from within a Python script. Here’s a simple example:
from prodigy import set_hashes
from prodigy.components.db import connect
db = connect() # this uses the DB settings in your prodigy.json
# load your examples – from a file or however else you want to
# just make sure they're in Prodigy's JSONL format. You can also use
# one of the built-in loaders like JSONL or CSV (see API docs)
examples = [{'text': 'Hello world', 'answer': 'accept'},
{'text': 'Another example', 'answer': 'reject'}]
# hash the examples to make sure they all have a unique task hash
# and input hash – this is used by Prodigy to distinguish between
# annotations on the same input data
examples = [set_hashes(eg) for eg in examples]
# add examples to the dataset
db.add_examples(examples, datasets=['your_dataset_name'])
Note that the add_examples
method expects the dataset to already exist in the database. If you want to add examples to a new set, you’ll need to create it first:
db.add_dataset('your_dataset_name')
Great.Thanks for explaining.
And what about db-out?
If I want to save annotations from ner.teach recipe.
Yes, that’s no problem either (see the source of the db-out
recipe or the database methods in the README). The get_dataset
method takes the name of a dataset, and returns a list of examples, which you can then save to a file – for example, JSON or JSONL (newline-delimited JSON, Prodigy’s preferred format):
examples = db.get_dataset('your_dataset_name')
examples
will be a list of dictionaries, with each dictionary describing one annotation example.
Really helpful.Thanks a lot
Self-written recipe does not work using this function.
May I know the reason behind it?
If you're using a custom recipe and you want to load it by its name (e.g. custom-recipe
), it needs to be registered globally first, so Prodigy knows which function to call. This is usually taken care of by the @prodigy.recipe
decorator. So if you register your custom recipe first, calling prodigy.serve
should work. For example:
import prodigy
@prodigy.recipe('custom-recipe')
def custom_recipe(dataset): # etc.
return {'dataset': dataset} # etc.
prodigy.serve('custom-recipe', 'your_dataset_name')
The @prodigy.recipe
decorator will register the recipe custom-recipe
, so prodigy.serve
can find it. Of course, you could also keep all your custom recipes in a separate module and import them from there – as long as you do that before calling prodigy.serve
.
@ines After I am done with all the annotations, and I have saved them using Ctrl+S, How do I break the serve function without explicitly breaking the script using Ctrl+C ?
I just want my rest of the script to be executed after the annotations are complete.
Thanks
By “the rest of the script”, you mean other code placed after prodigy.serve
? You could probably catch KeyboardInterrupt
, then execute your other logic and then terminate the process manually. Or you could just write your own logic that serves the app and includes hooks for starting and stopping – if you look at the app.py
, you’ll see that it’s really pretty straightforward and doesn’t need a lot of code.
Hi
I have the same need : I launch a custom ner.teach recipe using prodigy.serve('my.ner.teach')
function and I move to my browser to annotate.
def ner_teach(dataset, spacy_model, source=None, label=None, patterns=None,
exclude=None, unsegmented=False):
logger.info("Entering ner.teach process")
stream = source
nlp = spacy.load(spacy_model)
model = EntityRecognizer(nlp, label=label)
if patterns is None:
predict = model
update = model.update
else:
matcher = PatternMatcher(nlp)
matcher.add_patterns(patterns)
predict, update = combine_models(model, matcher)
if not unsegmented:
stream = split_sentences(nlp, stream)
stream = prefer_uncertain(predict(stream))
return {
'view_id': 'ner', # Annotation interface to use
'dataset': dataset, # Name of dataset to save annotations
'stream': stream, # Incoming stream of examples
'update': update, # Update callback, called with batch of answers
'exclude': exclude, # List of dataset names to exclude
'config': { # Additional config settings, mostly for app UI
'lang': nlp.lang,
'label': ', '.join(label) if label is not None else 'all'
}
}
At the end, I have a No tasks available.
message.
Is it possible to automate the “save to dataset” and kill the server whithout having to do the save manualy and close the server using ctrl-c in the console ?
According to this post, it seems that we could do that using a modification of the custom recipe, but I don’t know how…
After the teaching, I would like to analyse the dataset, retrain and compare with the results from last training.
Thanks
"No tasks available" really means that the stream
the recipe is returning doesn't include any examples. This can have several reasons: in your case, maybe the model and patterns don't produce any matches or suggestions for the given label. Or maybe all examples are already in your dataset, so there's nothing to send out.
None of this is really specific to Prodigy – basically, what you'd want to do here is find and kill the process that's running on the given host and port (e.g. 8080
) from within Python. I just did a quick google search and found this thread, maybe that helps? Is it possible in python to kill process that is listening on specific port, for example 8080? - Stack Overflow