Tutorial help with can't find recipe

I was doing this tutorial
https://github.com/explosion/prodigy-recipes/tree/master/image/image_caption
I put in this command
prodigy "image-caption caption_data_tmp "F:\Projects\cat_tut\Images" "F:\Projects\cat_tut[recipe.py](http://recipe.py/)

What did i do wrong?

hi @GitMatt-design,

It looks like you have an incorrect " in front of your recipe name: "image-caption, where it may likely be image-caption. I can't confirm exactly since I can't see your custom recipe.

What if you remove the " from the recipe name?

Hi @ryanwesslen

This is straight from the github and I am just lost on what command line to use where between in VSC or in the command line. Right now all I want to do with prodigy is be able to load up with ANY recipe so I can begin training spacy on anything. I have been reading the documentation. I am just little lost with everything.

hi @GitMatt-design,

Sorry for the confusion. I meant removing the " in your CLI command.

For example, this is what you ran:

python -m prodigy "image-caption caption_data_tmp "F:\Projects\cat _tut\Images" "F:\Projects\cat _tut \recipe.py"

Can you try this?

python -m prodigy image-caption caption_data_tmp "F:\Projects\cat_tut\Images" "F:\Projects\cat_tut \recipe.py"

I'm sorry you're having trouble.

If you want the simplest example, could you start with the Prodigy 101 example?

  1. Save this file: news_headlines.jsonl (19.5 KB) into a local folder.
  2. Run
python -m prodigy ner.manual ner_news_headlines blank:en ./news_headlines.jsonl --label PERSON,ORG,PRODUCT,LOCATION
  1. Annotate
  2. To train, run:
python -m prodigy train ./output_dir --ner ner_news_headlines
python -m prodigy ner.manual ner_news_headlines blank:en ./news_headlines.jsonl --loader jsonl --label PERSON,ORG,PRODUCT,LOCATION

Two quick questions @ryanwesslen
Do I run prodigy from home?
Where does it look for files?

I am getting file not found.

hi @GitMatt-design!

So typically you'd want to run prodigy in the folder where your files are.

For example, (assuming you're running with Mac), let's say you've created a folder in ~/Downloads/my_prodigy_project and that folder contains your .jsonl file: ~/Downloads/my_prodigy_project/news_headlines.jsonl.

If you start a new command line session, you may start at your root folder. You can get your current working directory with pwd:

$ pwd
/Users/ryan

You can also view the folder by running ls:

$ ls
Applications	Library			Public
Desktop			Movies			
Documents		Music
Downloads		Pictures

Then you can cd into your folder and ls to check the file is there:

$ cd ~/Downloads/my_prodigy_project
$ ls
news_headlines.jsonl

Now you can run:

$ python -m prodigy ner.manual ner_news_headlines blank:en ./news_headlines.jsonl --loader jsonl --label PERSON,ORG,PRODUCT,LOCATION
Using 4 label(s): PERSON, ORG, PRODUCT, LOCATION

✨  Starting the web server at http://localhost:8080 ...
Open the app in your browser and start annotating!

Alternatively, you can also point to the absolute path of your file too:

$ python -m prodigy ner.manual ner_news_headlines blank:en ~/Downloads/my_prodigy_project/news_headlines.jsonl --loader jsonl --label PERSON,ORG,PRODUCT,LOCATION

Both of these examples assume you've already installed/activated a virtual environment with your Prodigy installation and that you don't have any python alias issues (e.g., if so, you may need to run python3).

Does this help?

That helped but I am not sure why i got this.

It’s likely because you have a space in your path. Try to put double quotes around the path “F:\ner test\news_headlines.jsonl”

That did it and thank you so much for the help. Will it tell me where it saves the training data? I know I have to load it into spacy to train it at all.

By default, Prodigy will save your data into a SQLite database. It is saved as a "dataset" with the name you provided in your command. It looks like you provided ner_news_headlines.

You SQLite database is saved in your Prodigy home location. You can find this by running prodigy stats and looking for the Prodigy Home location.

$ prodigy stats

============================== ✨  Prodigy Stats ==============================

Version          1.13.0                        
Location         /Users/ryan/Downloads/my_prodigy_project/venv/lib/python3.9/site-packages/prodigy
Prodigy Home     /Users/ryan/.prodigy          
Platform         macOS-13.4.1-arm64-arm-64bit  
Python Version   3.9.17                        
Spacy Version    3.6.1                         
Database Name    SQLite                        
Database Id      sqlite                        
Total Datasets   96                            
Total Sessions   274      

For me, running a Mac, it is in /Users/ryan/.prodigy, which includes a file named prodigy.db.

Yep - let me review a few options you have once you have your annotations.

For the easiest way to view, you can export it as a .jsonl file using the db-out recipe (see this example):

prodigy db-out ner_news_headlines > ./annotations.jsonl

However, that's an optional step. If you want to train a model, you have two options:

  1. use prodigy train:

Like this example, you'd then run:

prodigy train --ner ner_news_headlines

You may also want to an (optional, but likely needed) argument of where to save this model, so you'll likely want to run:

prodigy train ./output_dir --ner ner_news_headlines

It's important to know that the prodigy train is a simplified wrapper for spacy train. This is nice when you're beginning, as we'll use a default config so you don't have to worry about the details about spacy train and spaCy config files, which can be a bit complex at first.

  1. For more intermediate to advanced users, alternatively you can use data-to-spacy then spacy train directly when you want to train.

For example:

prodigy data-to-spacy ./corpus --ner ner_news_headlines

This will create a default spaCy config file, a labels txt file, and two spaCy binary files with your annotated data. The two spaCy binary files are for training (train) and evaluation (dev). Note, this is a better practice of having a hold out evaluation set, while prodigy train will randomly partition your data each time. You can then run spacy train:

spacy train ./corpus/config.cfg --paths.train ./corpus/train.spacy --paths.dev ./corpus/dev.spacy

Feel free to try other recipes like train-curve:

prodigy train-curve --ner ner_news_headlines

Or, once you have a new model in the folder ./output (or whatever name you give it), you can then access one of the models (e.g., ./output/model-best) and now use that to annotate more examples, but this time using a correct recipe (e.g., ner.correct) and specify your new model ./output/model-best instead of using a Blank English model.

There are a lot of possibilities and workflows you can iterate on.

We've even created this NER annotation flowchart to show you several possible paths (be sure to save the pdf and check out the hyperlinks embedded in many of the decision boxes):

I'd encourage you to keep searching through the docs and support for other users tips and suggestions. This is a good first step but pretty soon, you'll find more advanced things you can do like setup a spaCy project for Prodigy workflow, like this demo project that integrates a lot of the steps I mentioned above into one project.

@ryanwesslen

Thank you again for all the help and I will read the docs. I might ask you the question I have been stuck on for years. How would I do this category.
Span

My plan is to OCR and use NER for item and price. that bit is self explanatory. The thing is how to I label the below items as apart of the same group. I believe that is span. I just wanted your thoughts.

Somewhat related, we've had a similar post on invoice parsing:

I'd recommend reviewing LJ's project where he uses a HuggingFace model that considers both text and image.

prodigy_correct (1)

I'd recommend first trying to reproduce the project. You can try to clone the repo, setup the requirements (and install tesseract), you should be able to reproduce the project. And then you could try to modify the project switch out the original data with your own. This is a bit of an advanced project as the task you're doing can be a bit tricky. You may also want to use Prodigy v1.11.14 (not v1.12) as there could be breaking changes we implemented with stream in v1.12 if you reproduce this project.

Hope this helps!