Why is standard deviation calculated differently for finding Z scores and confidence intervals? Sup

Matthew Hubbard

Matthew Hubbard

Answered question

2022-04-06

Why is standard deviation calculated differently for finding Z scores and confidence intervals?
Suppose that as a personnel director, you want to test the perception of fairness of two methods of performance evaluation. 63 of 78 employees rated Method 1 as fair. 49 of 82 rated Method 2 as fair. A 99% confidence interval for p 1 p 2 (where p 1 = 63/78 and p 2 = 49/82) is as follows:
p 1 p 2 ± 2.58 p 1 ( 1 p 1 ) n 1 + p 2 ( 1 p 2 ) n 2
0.029 p 1 p 2 0.391
At the 0.01 level of significance the Z score is
Z = p ( 1 p ) n 1 + p ( 1 p ) n 2
(where ( x 1 + x 2 ) / ( n 1 + n 2 ) = 0.70 but sometimes a different formula p = ( p 1 + p 2 ) / 2 is also used)
Z = 2.90
Both tests indicate that there is evidence of a difference.
But you could also find the Z score using the standard deviation formula in the first method to be 2.993. Why are the Z scores different? Where do the formulas for finding the standard deviation come from?

Answer & Explanation

Eliezer Olson

Eliezer Olson

Beginner2022-04-07Added 16 answers

First of all, what you call a "Z-score" is not actually a z-score. Rather, it is a standard error. This quantity characterizes the variability in the point estimate, which in this case is p ^ 1 p ^ 2 , the difference in sample proportions representing the difference in fairness rates of the two methods.
Suppose we have n 1 and n 2 ratings for each of the two methods, respectively, and of these, x 1 and x 2 ratings are "fair." Then the distribution of the number of fair responses for each method can be modeled with a binomial random variable, say
X 1 Binomial ( n 1 , p 1 ) , X 2 Binomial ( n 2 , p 2 ) ,
where p 1 and p 2 represent the true (but unknown) proportion of employees rating the respective method as fair. To test whether both methods have equal fairness ratings, we estimate the difference in these proportions, p 1 p 2 . A natural choice for point estimate is simply the difference in observed proportions; i.e.,
p ^ 1 p ^ 2 = x 1 n 1 x 2 n 2 ..
But such a point estimate does not tell us anything about how certain we can be about its value; e.g., if we had sampled only n 1 = 3 and n 2 = 5 employees, we cannot be expected to have much precision or confidence in the resulting estimate compared to, say, if we had sampled n 1 = 300 and n 2 = 500 employees.
To quantify the uncertainty as a function of the amount of data we observe--i.e. the sample sizes--we need to characterize the variance of the estimator p ^ 1 p ^ 2 . This is not mathematically difficult, but the answer depends on the model assumptions. For instance, the first assumption that we made was that the number of fair ratings within each group is each binomially distributed. But there is no requirement that this is so; e.g., you may have sampled a non-independent cohort of employees--each one that you surveyed told another coworker how they responded and possibly influenced other respondents' ratings. Or maybe you did not choose a representative cohort, and different groups of employees have different perspectives on the fairness of each method.
Additionally, your experimental design may influence the variability of your estimate: while not suggested by the data provided in your question, you could have collected a paired data set in which each surveyed employee is asked to rate both methods.
That said, under the assumption that employees don't speak to each other about the survey, they are only asked about one of the two methods, the method they are asked to rate is chosen completely independently and at random from the choice of employee, and each employee is chosen completely at random and is assumed to be representative of the whole population, then the binomial model is reasonable, and we have
Var [ p ^ 1 ] = Var [ X 1 / n 1 ] = p 1 ( 1 p 1 ) n 1 ,
and similarly,
Var [ p ^ 2 ] = p 2 ( 1 p 2 ) n 2 .
Then the variance of the difference is
Var [ p ^ 1 p ^ 2 ] = p 1 ( 1 p 1 ) n 1 + p 2 ( 1 p 2 ) n 2 .
Replacing the parameters with their point estimates, we obtain
Var ^ [ p ^ 1 p ^ 2 ] = p ^ 1 ( 1 p ^ 1 ) n 1 + p ^ 2 ( 1 p ^ 2 ) n 2 .
This is the square of the standard error of the point estimate, and is how the first formula you wrote is derived.
However, if we want to perform a hypothesis test rather than calculate an interval estimate, say
H 0 : p 1 = p 2 vs. H a : p 1 p 2 ,
then the standard error calculation is different, because the estimate p ^ 1 p ^ 2 under the null hypothesis is related to a binomial random variable with common success probability parameter, since the meaning of H 0 is that the two parameters p 1 and p 2 have the same value. In this case, you wish to leverage the additional information this assumption provides when calculating the test statistic. If this is true, then
Var [ X 1 n 1 X 2 n 2 H 0 ] = p pooled ( 1 p pooled ) n 1 + p pooled ( 1 p pooled ) n 2 = p pooled ( 1 p pooled ) ( 1 n 1 + 1 n 2 ) ,
where p pooled is the pooled fairness rating chance across both methods. But since this value is unknown (being a parameter), we must estimate it. How? The natural choice is
p ^ pooled = x 1 + x 2 n 1 + n 2 ,
because if X1 and X2 are binomial with parameters ( n 1 , p pooled ) and ( n 2 , p pooled ), then X 1 + X 2 Binomial ( n 1 + n 2 , p pooled ). This formula adjusts for unequal group sizes; e.g., if n 1 is very large compared to n 2 , then the information gathered from the larger group is given more weight when estimating the common proportion.
Finally, regarding the formula p = ( p 1 + p 2 ) / 2 that you wrote, this only applies if the sample sizes in each group are equal.

Do you have a similar question?

Recalculate according to your conditions!

New Questions in College Statistics

Ask your question.
Get an expert answer.

Let our experts help you. Answer in as fast as 15 minutes.

Didn't find what you were looking for?