Determining sample size of a set of boolean data where the probability is not 50%
atgnybo4fq
Answered question
2022-11-04
Determining sample size of a set of boolean data where the probability is not 50%
I'll lay out the problem as a simplified puzzle of what I am attempting to calculate. I imagine some of this may seem fairly straightforward to many but I'm starting to get a bit lost in my head while trying to think through the problem.
Let's say I roll a 1000-sided die until it lands on the number 1. Let's say it took me 700 rolls to get there. I want to prove that the first 699 rolls were not number 1 and obviously the only way to deterministically do this is to include the first 699 failures as part of the result to show they were in fact "not 1".
However, that's a lot of data I would need to prove this. I would have to include all 700 rolls, which is a lot. Therefore, I want to probabilistically demonstrate the fact that I rolled 699 "not 1s" prior to rolling a 1. To do this, I decide I will randomly sample my "not 1" rolls to reduce the set to a statistically significant, yet more wieldy number. It will be good enough to demonstrate that I very probably did not roll a 1 prior to roll 700.
Here are my current assumptions about the state of this problem:
- My initial experiment of rolling until success is one of geometric distribution.
- However my goal for this problem is to demonstrate to a third party that I am not lying, therefore the skeptical third party is not concerned with geometric distribution but would view this simply as a binomial distribution problem.
A lot of sample size calculators exist on the web. They are all based around binomial distribution from what I can tell. So here's the formula I am considering:
n is sample size
N is population size
Z is critical value ( is )
p is sample proportion
MOE is margin of error
As an aside, the website where I got this formula says it implements "finite population correction", is this desirable for my requirements?
Here is the math executed on my above numbers. I will use for , and . As stated above, on account of there being 699 failure cases that I would like to sample with a certain level of confidence.
Based on my understanding, what this math will do is recommend a sample size that will show, with 99% confidence, that the sample result is within 0.5 percentage points of reality.
Doing the math, and , implying that I can have a sample size of 193 to fulfill this confidence level and interval.
My main question is whether my assumption about is valid. If it's not, and I use the conservative , then my sample size shoots up to . So I would like to know if my assumptions about what sample proportion actually is are correct.
More broadly, am I on the right track at all with this? From my attempt at demonstrating this probabilistically to my current thought process, is any of this accurate at all?