I would appreciate some guidance if you have the time.
As a passion project, for a complete newcomer, I am trying to use Prodigy as one important tool to create a web app that reads recipes, recognizing the ingredients, units of measurements, and amounts (numbers) and returns a nutritional profile of the recipe. To start, I have a CSV file of USDA data that includes more than 10,000 possible ingredients and food items and their nutritional contents.
So far, I have followed your YouTube tutorial (Training a NAMED ENTITY RECOGNITION MODEL with Prodigy and Transfer Learning - YouTube). It's helped get me started, but it also has me questioning my assumptions and how to best apply Prodigy's tools to achieve my goals. At first, I was tagging some data in this spreadsheet as INGS (ingredients) and rejecting others, as many are food items (e.g almond joy candy bar) as opposed to actual ingredients (e.g. almonds). It is not always an easy to make the distinction as some recipes could call for a candy bar as an ingredient in some recipes...
But I digress.
- To be clear, I think what I lack is an understanding of what the testing data vs. the validation data should be for this project.
I was going to use part of the spreadsheet as the testing data, and the rest of the spreadsheet as the validation data, using Prodigy's patterns.to-terms function.
Now, I am thinking I should change my approach, based on my understanding of the video, and after my first attempts to tag the data.
- Now, I am assigning an ING tag to all the ingredient items and rejecting the food items for the ENTIRE spreadsheet. I plan on using this data - all the ING data in the spreadsheet - as the training data. For my validation data, I now plan to use a free Kaggle dataset of actual recipes - to see how well it "reads" the ingredient information.
So it is a basic question, I hope, about what constitutes testing/training data, as opposed to validation data, for this project from a complete newcomer.
Does this sound like a better approach given the project's goals? Is there another that springs to mind?
- I have a another, fundamental question, but I don't think its "Prodigy" question unless I am missing something. One problem with the spreadsheet it that it presents some ingredient data with the words not in natural order. In English, this means it places the noun before the adjective. So an ingredient like "adzuki beans" is presented as "beans adzuki." I am assuming this is a question for the search engine module (Python fuzzy wuzzy) and whether or not it recognizes reversed strings.
But I am throwing it out there in case you have some insights.
Now that I have Prodigy up and running, I am having a blast! I am hoping you can answer some of these basic questions, even though some are more conceptual than techinical. I just don't want to start over because I made some wrong assumptions as a beginner about how Prodigy can help to achive the project's overall goals.
Thank you (in advance) for insights you might have.
Your feedback is most appreciated.