Question about Testing/Traning vs. Validation Data

I would appreciate some guidance if you have the time.

As a passion project, for a complete newcomer, I am trying to use Prodigy as one important tool to create a web app that reads recipes, recognizing the ingredients, units of measurements, and amounts (numbers) and returns a nutritional profile of the recipe. To start, I have a CSV file of USDA data that includes more than 10,000 possible ingredients and food items and their nutritional contents.

So far, I have followed your YouTube tutorial (Training a NAMED ENTITY RECOGNITION MODEL with Prodigy and Transfer Learning - YouTube). It's helped get me started, but it also has me questioning my assumptions and how to best apply Prodigy's tools to achieve my goals. At first, I was tagging some data in this spreadsheet as INGS (ingredients) and rejecting others, as many are food items (e.g almond joy candy bar) as opposed to actual ingredients (e.g. almonds). It is not always an easy to make the distinction as some recipes could call for a candy bar as an ingredient in some recipes...

But I digress.

  1. To be clear, I think what I lack is an understanding of what the testing data vs. the validation data should be for this project.

I was going to use part of the spreadsheet as the testing data, and the rest of the spreadsheet as the validation data, using Prodigy's patterns.to-terms function.

Now, I am thinking I should change my approach, based on my understanding of the video, and after my first attempts to tag the data.

  1. Now, I am assigning an ING tag to all the ingredient items and rejecting the food items for the ENTIRE spreadsheet. I plan on using this data - all the ING data in the spreadsheet - as the training data. For my validation data, I now plan to use a free Kaggle dataset of actual recipes - to see how well it "reads" the ingredient information.

So it is a basic question, I hope, about what constitutes testing/training data, as opposed to validation data, for this project from a complete newcomer.

Does this sound like a better approach given the project's goals? Is there another that springs to mind?

  1. I have a another, fundamental question, but I don't think its "Prodigy" question unless I am missing something. One problem with the spreadsheet it that it presents some ingredient data with the words not in natural order. In English, this means it places the noun before the adjective. So an ingredient like "adzuki beans" is presented as "beans adzuki." I am assuming this is a question for the search engine module (Python fuzzy wuzzy) and whether or not it recognizes reversed strings.

But I am throwing it out there in case you have some insights.

Now that I have Prodigy up and running, I am having a blast! I am hoping you can answer some of these basic questions, even though some are more conceptual than techinical. I just don't want to start over because I made some wrong assumptions as a beginner about how Prodigy can help to achive the project's overall goals.

Thank you (in advance) for insights you might have.

Your feedback is most appreciated.

Yay, that's great to hear! :tada:

Yes, this is also something I noticed when I did the annotation and I think you'll have to accept some ambiguity here. It's kinda inherent to language because language just doesn't always map neatly into categories. But even just going through this process and labelling some real-world data can be incredibly helpful because it lets you check whether you label scheme and distinction makes sense, and helps you challenge your assumptions that might not map to what's in the data. This is super important in the development process and also something that Prodigy can help with :slightly_smiling_face:

At a minimum, you typically need two datasets at the end: one to train the model that's used to update it, and one for evaluation that the model doesn't get to see and that's used to compare the model's predictions against unseen examples. This is how you calculate the accuracy score at the end. So your evaluation data should be representative of what your model will see at runtime (because otherwise, your accuracy score won't be very meaningful).

While you can create those separately, you don't actually need to in Prodigy, especially during development. A common strategy is to just hold back a certain percentage of your examples for evaluation and only train from the rest. So when you run prodigy train with no dedicated evaluation set, it will hold back a certain percentage (which you can customise via the --eval-split setting). So if you have 1000 examples, you might train on 800 and hold 200 back for evaluation.

Once you're getting serious about evaluation, it's usually a good idea to use a single, dedicated evaluation set that doesn't change. This lets you compare the performance of your model as you collect more data in your training set. (Because if your evaluation data changes with every run, your accuracy comparison wouldn't really be meaningful.)

Ideally, you want the training and evaluation data to be similar to what your model will see at runtime. So you should be annotating the same data with the same categories and use one portion for training and one portion for evaluation.

This is definitely something the model can learn – after all, the idea of named entity recognition is to generalise to unseen examples based on examples the model was trained on. So even if those expressions weren't originally captured by your rules, you can still label them manually and if there are enough of them in your training data, the model will be able to recognise similar phrases that follow a similar pattern.

1 Like