First of all, what you call a "Z-score" is not actually a z-score. Rather, it is a standard error. This quantity characterizes the variability in the point estimate, which in this case is , the difference in sample proportions representing the difference in fairness rates of the two methods.
Suppose we have and ratings for each of the two methods, respectively, and of these, and ratings are "fair." Then the distribution of the number of fair responses for each method can be modeled with a binomial random variable, say
where and represent the true (but unknown) proportion of employees rating the respective method as fair. To test whether both methods have equal fairness ratings, we estimate the difference in these proportions, . A natural choice for point estimate is simply the difference in observed proportions; i.e.,
.
But such a point estimate does not tell us anything about how certain we can be about its value; e.g., if we had sampled only and employees, we cannot be expected to have much precision or confidence in the resulting estimate compared to, say, if we had sampled and employees.
To quantify the uncertainty as a function of the amount of data we observe--i.e. the sample sizes--we need to characterize the variance of the estimator . This is not mathematically difficult, but the answer depends on the model assumptions. For instance, the first assumption that we made was that the number of fair ratings within each group is each binomially distributed. But there is no requirement that this is so; e.g., you may have sampled a non-independent cohort of employees--each one that you surveyed told another coworker how they responded and possibly influenced other respondents' ratings. Or maybe you did not choose a representative cohort, and different groups of employees have different perspectives on the fairness of each method.
Additionally, your experimental design may influence the variability of your estimate: while not suggested by the data provided in your question, you could have collected a paired data set in which each surveyed employee is asked to rate both methods.
That said, under the assumption that employees don't speak to each other about the survey, they are only asked about one of the two methods, the method they are asked to rate is chosen completely independently and at random from the choice of employee, and each employee is chosen completely at random and is assumed to be representative of the whole population, then the binomial model is reasonable, and we have
and similarly,
Then the variance of the difference is
Replacing the parameters with their point estimates, we obtain
This is the square of the standard error of the point estimate, and is how the first formula you wrote is derived.
However, if we want to perform a hypothesis test rather than calculate an interval estimate, say
then the standard error calculation is different, because the estimate under the null hypothesis is related to a binomial random variable with common success probability parameter, since the meaning of is that the two parameters and have the same value. In this case, you wish to leverage the additional information this assumption provides when calculating the test statistic. If this is true, then
where is the pooled fairness rating chance across both methods. But since this value is unknown (being a parameter), we must estimate it. How? The natural choice is
because if X1 and X2 are binomial with parameters and , then . This formula adjusts for unequal group sizes; e.g., if is very large compared to , then the information gathered from the larger group is given more weight when estimating the common proportion.
Finally, regarding the formula that you wrote, this only applies if the sample sizes in each group are equal.