I recently had an idea for an app that I would like to start developing for personal use and develop
Mohammad Cannon
Answered question
2022-06-07
I recently had an idea for an app that I would like to start developing for personal use and development, that attempts to present you with recipe idea's for lunch/dinner etc and by recording your responses learns your preferences. I was thinking it would do this by recording very specific details off each recipe such as carbcount caloriecount proteincount and etc, (factors which might act as determinants for our preference). Then this program would run an OLS regression with prob of being chosen as the dependent variable (for which we will have data on as we know what our user rejected (recipe) and what he accepted and how many times). We will then have various independant variables with which we will try and create an unbiased estimator. We can then run all recipe's to be presented under this regression and rank the recipes in order of probability to be chosen, highest to lowest. Would this be a viable thing to do? If no, why not and what could perhaps be better?
Answer & Explanation
Belen Bentley
Beginner2022-06-08Added 28 answers
You could use an OLS regression for this, or you could just use a machine learning algorithm. A decision tree will probably work just as well as any regression model (and possibly better). If you definitely want to use a parametric regression model, there are three pretty standard models with probability as an outcome. What you're describing is a linear probability model. Let Y be a binary dependent variable and X the vector of covariates. Then, the model has the form
One nice thing about this type of model is that the coefficients are very easy to understand. So, for example, if x1 represents calories, then β^1 is the predicted change in Y associated with an increase of 1 calorie. It's also very easy to compute the predicted values, since it's just a product of two vectors. The major drawback is that there's no limit on the predicted outcome. Imagine plugging in a meal with an absurd number of calories. Then, depending on whether β^1 is positive or negative, we could end up with P[Y=1|X=x]>1 or P[Y=1|X=x]<0, which shouldn't be possible for a probability. The way to combat this issue is to use a probit or logit model. A probit model has the following form:
where is the standard normal cumulative distribution function. A logit model has the form
Both the probit and logit models restrict the predicted probability to the interval [0,1], but on the flip side, the coefficients are more difficult to interpret. The difference between probit and logit is in the assumption you're making about the distribution of the residuals - with probit, you're assuming that the residuals are normally distributed, and with logit, you're assuming they have a logistic distribution. In practice, they're usually pretty similar, and you'll likely get very similar outcomes with the two.