Chapter 8: Statistical Inference: Confidence Intervals 8.1 What are Point and Interval Estimates of Population Parameters? When we first began our discussion of statistics, we mentioned that there were two branches of statistics: descriptive and inferential. The inferential branch uses sample information to draw conclusions about the population. One of the most common uses of the inferential branch is to use sample statistics, such asx, to estimate population parameters, such as µ. It makes sense that if we take a large enough sample,x should be pretty close to the actual value of µ. But the chances are pretty small thatx turns out to be exactly µ. Hopefully if we have a good sample, it will be close, but unfortunately the fact is that µ is unknown. So we do not know how close we are with our sample mean. Definition We say thatx is a point estimate of µ because it is a good starting point to estimate the unknown population mean. Similarly, p is a point estimate of p because it is a reasonable starting point to estimate the population proportion, which is also unknown. Page 1 of 54 The key here is that sample statistics estimate population parameters. For example,x is a point estimate of µ and p is a point estimate of p. It is a good start to say that x is a reasonable estimate of µ, but as discussed it would be naïve to declare that x is all we need. A better approach is to instead report a range of likely values that the population mean could potentially take. That is, we report an interval estimate: an interval of numeric values, centered around x. And we can say that µ most likely lies somewhere inside this interval. Example Suppose you were asked to estimate the average age of all the students in our class. You might survey 10 students and find their average age to be 20. This sample mean, x=20, would be a point estimate of µ. However, you could also express your guess by giving a range of ages centered around your sample mean. So your guess could be 20, give or take 2 years. This “give or take 2 years” part is what we call the margin of error which we will talk about more later. So mathematically, your guess would be 20 +/- 2 which would be the interval estimate. Put another way, the average age of all students is most likely somewhere between 18 and 22. Suppose you were then asked how confident you were that µ, the mean age of all students, was within your interval Page 2 of 54 estimate of 18 to 22 years old. You might say “I am 95% confident that the mean age of all students is within 18 to 22 years old.” Definition The margin of error is the number that we add to (and subtract from) the point estimate in order to form the upper and lower limits, respectively. It is also the width of the interval. In the previous example... Point estimate = 20 Margin of error = 2 Width of the interval = 2 (same as margin of error) Lower limit = 20 - 2 = 18 Upper limit = 20 + 2 = 22 Confidence = 95% In statistics, we construct intervals for the population mean that are centered around an estimate. This estimate isx, the sample mean. Since we can’t get the full population mean, we go for the next best thing. We take a sample and calculate a sample mean. And what we add and subtract from the sample mean to get the interval estimate is the margin of error. When we construct these interval estimates, we call them confidence intervals. Page 3 of 54 So a confidence interval of a parameter consists of an interval of numbers, and this interval is our point estimate +/- our margin of error. Just as in our example above where our interval was 20 +/- 2, or in other words, 18 to 22. 20 is our point estimate and 2 is our margin of error. We call the value we obtain when we take the point estimate minus the margin of error, in our example 20 – 2 or 18, the lower limit or lower bound. And we call the value we obtain when we take the point estimate plus the margin of error, in our example 20 + 2 or 22, the upper limit or upper bound. You will also see the notation of the lower and upper limit in parentheses for confidence intervals. In our above example, the confidence interval may be written as (18,22). It is also important for us to note the level of confidence of a confidence interval. In our example before, our level of confidence would have been that we were 95% confident that the mean age of all students in the class was somewhere on our interval. So the level of confidence is the probability that the interval contains the population parameter, in this case, µ. We will see in examples that as we increase our level of confidence, we will get wider and wider intervals. Page 4 of 54 We will be constructing two different types of confidence intervals: 1. In Section 8.2, we will be calculating the confidence interval for the population proportion, p. 2. In Section 8.3, we will be calculating the confidence interval for the population mean, µ (like our classroom age example). Before we get to these sections, let’s make sure we understand the terms in the example on the next page. In this example, the confidence interval will already be constructed for us. In Sections 8.2 and 8.3, we will actually learn how to construct these confidence intervals. Example Suppose a farmer is trying to estimate the average number of peaches per tree in his orchard. He does not want to count every peach on every tree, so he takes a random sample of a few trees and calculates a 95% confidence interval based on the sample. That 95% confidence interval for the mean yield of a new variety of peaches in an orchard is 112 to 148 peaches per tree. This means that we are 95% confident that the population mean, µ, for the number of peaches per tree is somewhere between 112 and 148 peaches per tree. Page 5 of 54 What is the lower limit? What is the upper limit? What is the level of confidence? What is the sample mean,x? *Remember, the sample mean is always the middle of the confidence interval. The sample mean,x, will always be on the confidence interval, but the population mean, µ, may or may not be on the confidence interval.* What is the margin of error? What is the width of the confidence interval? 8.2 How Can We Construct a Confidence Interval to Estimate a Population Proportion? Recall from Section 8.1 that confidence intervals can be written in the general format: point estimate +/- margin of error. The point estimate and margin of error change depending on what parameter is being estimated. For example, we looked at an example of a Confidence Interval for µ, so our point estimate wasx. Page 6 of 54 Now we will consider the format of the Confidence Interval for the population proportion, p. The point estimate for this type of Confidence Interval is the sample proportion, = x/n, where x is the number of individuals in the sample with the desired characteristic and n is the sample size. So we know what goes before the +/-, the point estimate, and we calculate that easily. Now we need to know how to calculate what goes after the +/-, the margin of error. First, the basic formula for a confidence interval is point estima ± rgin of eor If we have these two numbers given up front, that’s all we need to construct the confidence interval. Example Suppose we took a sample of 60 students and asked them whether they had ever sent a text message during a class. 44 of them said yes, and the margin of error was 11 percentage points. Is this a proportions or means problem? Point estimate = Margin of error = Confidence Interval = Page 7 of 54 The margin of error will always be a multiple of the standard error. In Section 8.2, we discuss confidence intervals for population proportions, so the standard error will be: p 1− () n Why is it now in the formula and not p, like it was in Chapter 7? This is because, when we dealt with sampling distribution questions, the population proportion was given to us, so we were able to get the exact standard error then. But in reality, the true proportion will not be known (if it were, we would not need to estimate it!) Hence, the standard error cannot be computed exactly. The next best thing we can do is replace the p with p in the equation. So the margin of error will always be some number times the standard error we see above. The number we multiply the standard error by to get the margin of error TOTALLY depends on the level of confidence. The general formula for a confidence interval for the population proportion is: p ±z 1− () n Page 8 of 54 You can see that the margin of error is this “Z” value times the standard error. Later on in this chapter, we will see how to get this Z value, because this Z value TOTALLY depends on our level of confidence, how confident we want to be that the population proportion is on our interval. Summary For a proportions confidence interval, we have the following: Point Estimate = p Standard Error = 1− () n Margin of Error = z p () Confidence Interval = ± 1−p () n For now, we will just focus on 95% confidence intervals, as this is the most common type of interval. For 95%, it can be shown that z = 1.96. For a 95% confidence interval, the margin of error is 1.96× p 1− () n Page 9 of 54 So the formula for a complete 95% confidence interval, point estimate +/- the margin of error, equals p ±1.96× 1−p () n The lower limit is: The upper limit is: Think way back to the Empirical Rule and we can see why this “1.96” makes sense. Using the Empirical Rule, we said approximately 95% of the data values are within two standard deviations (or standard errors) of the parameter. Now, we are starting with p , and we want to add and subtract something from that value and get an interval, and we want to be 95% confident that the interval contains p, the true population proportion. So, using the same Empirical Rule logic, if we start with and add and subtract close to 2 standard errors (1.96 to be exact) it makes sense that we are going to be 95% confident that the p value will be within that interval. Page 10 of 54 Example We asked n = 1154 Americans “Would you be willing to pay $6 per gallon of gas?”. In our random sample, 518 said they would be willing to pay $6 per gallon of gas. a. Find a 95% confidence interval for the population proportion of Americans willing to pay $6 per gallon of gas. First, we need our sample proportion. p = # who said yes smple sz = 518 1154 .44887 Next we need the standard error. s.e= p1− () n .448871−() 1154 =.01464 The 95% confidence interval: p ±1.96× p ( n So the interval is .44887 +/- 1.96*0.01464 = .44887 +/- .02870 So the lower limit is .44887 - .02870 = .42017 And the upper limit is .44887 + .02870 = .47757 So our 95% confidence interval is (.42017, .47757) Page 11 of 54 b) Interpret the interval. We are 95% confident that the proportion of ALL Americans that are willing to pay $6 per gallon of gas is somewhere between .42017 and .47757, in other words, between 42.017% and 47.757%. c) EXTRA QUESTION: Does it appear likely that 50% of ALL Americans are willing to pay $6 per gallon of gas? No, 50%, or .50, is not on our interval. so it does not appear likely that 50% of all Americans are willing to pay $6 per gallon of gas. Our interval went between .42017 and .47757, so it appears that less than 50% of ALL Americans are willing to pay $6 per gallon of gas. Consider the summary number line below. The middle region is the range of all plausible values that p could take. So anything outside this region (less than .42017, or more than .47757) is an unlikely value for the population proportion. As you can see, .50 falls in the unlikely region. Page 12 of 54 Assumptions for a Confidence Interval for a Proportion: For these confidence intervals to be valid, we need to check some requirements as we did back when we were determining the sampling distribution for the sample proportion. The following three things must be true: (1) np ≥15 (2) 1− () (3) The sample must be random. Again, we use p rather than p because p is unknown. The next best thing is to use the point estimate instead. Check that this is true in the above gas example: Interpretation of a Confidence Interval As mentioned in the earlier example, a 95% confidence interval means we are 95% certain that the population proportion lies somewhere inside our interval of values. We can’t pin it down more accurately without having a larger sample size, but we can assert that with 95% probability, it falls somewhere insider the found interval. Page 13 of 54 Be careful here: it is tempting to say “the population mean lies in the interval 95% of the time.” This is not the correct way to interpret an interval. The reason is because the population mean, while unknown, is fixed. What changes from sample to sample is the confidence interval. That is, the population mean will always be the same, but when we draw many samples from the population, we will get different confidence intervals. Most of them will contain p or µ inside them, but there will be an unlucky few that by chance miss it. How many intervals will capture p or µ? That depends on the level of confidence. As an example, suppose we draw 1000 samples and compute an interval for each. Also suppose we are using 95% confidence. Then we can expect about 95% of these 1000 to contain p or µ inside somewhere. That number is 1000 x .95 = 950 Of course, there is no reason to suppose this number is exact; maybe 948 of them will contain it. The 950 is only an approximation. So we could say that, say, between 945 and 955 of the intervals are expected to contain p or µ inside, and the rest to not. Now suppose we have 1000 99% intervals. About how many will you expect contain p or µ? Page 14 of 54 Here is an illustration of what it means for a population mean to fall inside / outside an interval: Interpretation Summary of a 95% Confidence Interval (1) We are 95% certain the true value for the population mean / proportion falls somewhere inside our interval. (2) Long-run interpretation: if we construct a large number of 95% confidence intervals, approximately 95% of them will contain p or µ inside somewhere (and about 5% will miss it). (cf. Law of Large Numbers) Page 15 of 54 Example We take a survey of 884 people at random and ask each of them whether they secretly listen to “bubblegum dance” music. 221 of them confess that they do. (a) Obtain a point estimate for the population proportion of people who listen to this genre. (b) Verify that the requirements for constructing a confidence interval about p are satisfied. (c) Construct a 95% CI for the population proportion of people that enjoy bubblegum dance. (d) Interpret this interval. Page 16 of 54 How can we use a Confidence Level Other Than 95%? So far we have just been creating 95% confidence intervals, so our margin of error has been 1.96 * (standard error). But where does this 1.96 come from? And what if we want something different than a 95% confidence interval? We can never have a 100% confidence interval, because we can never be 100% sure that the population proportion is within the interval if we don’t know it. As an example, a “true” 100% interval would be saying that the population proportion is somewhere between 0 and 1. That is certainly an absolutely correct statement, but it is rubbish. It tells us nothing useful! That’s why we don’t use it. Here is how we get the 1.96 for a 95% confidence interval: First, when you think of a 95% confidence interval, think of a normal curve with 95% shaded in the middle like this: Page 17 of 54 If .95 is in the middle, then what’s the area in each tail? 1 - .95 = .05 and .05/2 = .025 so .025 is in each of the tails. (cf. symmetry) Now, put .025 as the little tail area to the right in your StatCrunch calculator with mean = 0 and standard deviation = 1, hit Compute and you get 1.96! This 1.96 is the Z-score that matches up with a 95% confidence interval. This is why it was important for us to find those probabilities involving Z-scores before. NB: For a 95% confidence interval, we can get away with using the rounded number 1.96. But for the other levels, you have to use at least 5 decimals. Page 18 of 54 This Z-score value is what we now take and multiply the standard error by to get the margin of error for a confidence interval, and it will always be a positive Z-score. Let’s work through it again and see what the Z-score would be for a 90% confidence interval. First, draw the curve with .90 in the middle and find the area of both tails: Then put the area of the tail in the StatCrunch Normal Calculator with mean = 0 and standard deviation = 1: The z-score you get is 1.64485. Page 19 of 54 So to get the margin of error for a 90% confidence interval you multiply the standard error by 1.64485. Let’s use this in the following example. Example A study of 70 randomly selected people in Athens was conducted to estimate the proportion of Athens residents that have sung karaoke. The study revealed that 42 of the 70 people have sung karaoke. a) Obtain a point estimate for the population proportion of “karaoke singers” in Athens. b) Verify that the requirements for constructing a confidence interval about p are satisfied. c) Construct a 90% confidence interval for the proportion of Athens residents who have sung karaoke. Page 20 of 54 d) Interpret the confidence interval. Now, using the same example as above, construct a 99% confidence interval. Let’s see how the interval changes if we increase the confidence level. We have the point estimate and the standard error, so we just need the new Z-score for this confidence interval: First, draw the curve with .99 in the middle and find the area of both tails: Then put the area of the tail in the StatCrunch Normal Calculator with mean = 0 and standard deviation = 1: The Z-score = Now create the 99% confidence interval: Page 21 of 54 Notice that the 99% confidence interval is wider than the 90% confidence interval. In this example, we saw that... (1) As the level of confidence increases, the margin of error increases and the confidence interval gets wider. (2) As the level of confidence decreases, the margin of error decreases and the confidence interval gets narrower. This applies to all confidence intervals, like in the picture below: Why is this true? With a 95% confidence interval, we want to be 95% confident that the population parameter is on the interval. But with a 99% confidence interval, we want to be even more confident (99% confident) that the population parameter is on the interval. So to be that much more certain the proportion is on the interval, we need a wider interval. Page 22 of 54 We can also see it mathematically. Note that in the margin of error formula, the standard error is multiplied by z. Thus, if you increase z, the overall margin of error must increase as well: z p 1− () n We have seen what happens when we change the confidence level. But what about if we change the sample size? It turns out that the following is true: (1) As the sample size increases, the margin of error decreases and the confidence interval gets narrower. (2) As the sample size decreases, the margin of error increases and the confidence interval gets wider. So the opposite happens when we increase the sample size. The confidence interval gets narrower. Why is this true? As we increase our sample size, the sample statistic we obtain (whether we are looking for a mean or a proportion) is a better representation of the population. So as we increase our sample size, our point estimate is a better and better estimate, and we don’t need such a wide confidence interval. That is, our error will be smaller, so we don’t go out quite as much on both sides. Page 23 of 54 We can also argue why mathematically. Notice that the sample size n is on the denominator of the margin of error. When the denominator gets larger, the overall number gets smaller. z p 1− () n Recap The following symbols go along with the following terms when calculating the confidence interval for the population proportion: Page 24 of 54 Term Symbol Point Estimate p Standard Error 1− () n Margin of Error z p () Confidence Interval ± 1− () n How can StatCrunch calculate these confidence intervals for us? Think back to our example where we wanted to get a 90% confidence interval for the population proportion of Athens residents that have sung karaoke. This was on page 18 of these notes. Well, if enough information is given, then StatCrunch will obtain the interval for us! Go to... Stat Proportions One Sample With Summary Page 25 of 54 Here we can type in how many Athens residents sung karaoke in our sample. In our sample, 42 out of 70 did. Put those numbers in just like this and hit Next (don’t choose Calculate just yet): On the next screen choose “Confidence Interval”, and we want a 90% confidence interval, so change the 0.95 to 0.90: Hit Calculate, and here is what we get: Page 26 of 54 It tells us the Sample Proportion is .6, which is the point estimate. It is the same thing we got in part (a). It also tells us the lower limit (.50369) and the upper limit (.69631), the same values we calculated! Notice it also gives us the standard error. The only values it does not give us are the margin of error, and the Z-score used in the formula, so we still would need to know how to get those by hand. Now get the 99% Confidence Interval and check it against our answers of (0.44917, 0.75083). Other Important Facts These have been pointed out directly or indirectly earlier, so let’s go over them again. First, the sample mean is always inside the confidence interval. This is true because of the way the interval was constructed: we start with the point estimate, then go left and right the same number of units. So, the sample mean / proportion is always in the center of the interval. Page 27 of 54 Is the population mean / proportion ever inside the interval? As we’ve seen, maybe it is, maybe it isn’t. There is no way of assuring this. So the point here is, the sample mean (statistic) is always inside the interval, at the center, but the population mean (the parameter) is in the interval only sometimes. Example We have a 95% confidence interval that is (.20, .40). Can you find the sample proportion? If so, find it. Can you find the population proportion? If so, find it. Find the margin of error. Now suppose another confidence interval of the same sample size is drawn, and this interval is (.23, .37). Is it more likely a 92% or a 98% interval? Why? Example In a sample of size 102, 71% of the subjects answered yes in a survey. The standard error is .034. Find a 93% confidence interval for the population proportion. Page 28 of 54 Section 8.3 How Can we Construct a Confidence Interval to Estimate a Population Mean? Recall from Section 8.1 that confidence intervals can be written in the general format: point estima ± rgin of eor Remember the point estimate is a single number that is our “best guess” for the parameter. What single number is the “best guess” for a population mean if we only have a sample from the population? The sample mean. So the sample mean is the point estimate part of the confidence interval formula. It is the center of the confidence interval, so now we need to know the margin of error. We need to know what to add and subtract from the point estimate to get the lower and upper limits of our confidence interval. Just like in Section 8.2, the margin of error will be some number times the standard error. But the formula for the standard error when we are talking about means is: Standard Error = s n where s = sample standard deviation. Page 29 of 54 We saw the formula for the standard error back in Chapter 7 was σ n , but we don’t know anything about the population so we don’t know σ. The next best thing is to use the standard deviation from our sample, s. Compare this to the proportions situation, where we replaced p with p because p was unknown. So we are this far into our formula for the confidence interval for the population mean: x±some number() s n All we have left to find is the “some number”. We saw in confidence intervals for the population proportion, that this “some number” ended up being a Z-score that corresponded with the level of confidence. For confidence intervals for the population mean, the “some number” still corresponds with the level of confidence, but it is from a new distribution that we call the T-distribution. So if you look on Stat Calculators you will see a calculator called “T”. Page 30 of 54 Before we see how to get these “T values” let’s talk about the properties of this T-distribution, and how the T- distribution or T-curve is different from the normal distribution or normal curve. Properties of the T-Distribution 1. The T-distribution is centered at 0 and is symmetric about 0 (cf. standard normal). 2. The total area under the curve is 1 (cf. standard normal). 3. The area to the right of 0 is 0.50 and the area to the left of 0 is 0.50 (cf. standard normal). 4. The T-distribution is different for different values of n, our sample size. 5. The T-distribution is leptokurtic: the area in the tails is a little greater than the area in the tails of the normal distribution. Furthermore, the T’s apex juts higher than that of the normal. 6. As the sample size n increases, the T curve looks more and more like the normal curve. The diagram and words that follow are not essential to remember; rather, they are merely a result of the fact that a certain statistics instructor loves the three words and their Greek origins. Leptokurtic => λεπτός (thin) + κυρτος (bulging) Mesokurtic => µέσος (middle) Platykurtic => πλατύς (flat) Page 31 of 54 Since the T-distribution looks different for different values of n, we always have to type in what we call the “degrees of freedom” on the T calculator. The degrees of freedom we have to put in the T calculator = n – 1. The degrees of freedom on the T-calculator is abbreviated as “DF”. So DF in StatCrunch = n – 1; that is, the sample size minus one. Try some different DF values in StatCrunch and see how the T-distribution changes for different sample sizes. Stat Calculators T Try DF = 5. Then try DF = 500, this one looks more like our normal curve. So our confidence interval formula for the population mean is: Page 32 of 54 Lower limit: x−t⋅ s n Upper limit: +⋅ These intervals are valid when... 1. The sample obtained is random, and... 2. One of the following is true: a. We are sampling from a normal population, or... b. n > 30 (cf. sampling distribution of sample means) So we can get the sample mean, sample standard deviation and n value, but we haven’t yet talked about what the T value is that we want from the T Calculator. To get the T value is just the same as getting the Z value when we were doing confidence intervals for the population proportion in Section 8.2. The only difference is that the T value depends on BOTH the confidence level and the sample size. Example Let’s find the T value for a 95% confidence interval if the sample size we used is n = 32. First, draw a curve with .95 in the middle and find the area of both tails: Page 33 of 54 Next, put in the right tail area = .025 in the T Calculator and put DF = 32 – 1 = 31. Hit Compute and you get T = 2.0395. This is the value we will use in the confidence interval formula. Page 34 of 54 Example (cf. Homework 8.3 - 8.4) Find the t-score for a 99% confidence interval for a population mean with 5 observations in our sample. First, draw a curve with .99 in the middle and find the area of both tails: Next put in the right tail area = .005 in the T Calculator AND put DF = 5 - 1 = 4. Hit Compute and you get T = So now we can construct confidence intervals for the population means. Here is a quick summary table of the formulas for means problems: Term Symbol Point Estimate x Standard Error s n Margin of Error t⋅ Confidence Interval x±t⋅ s n Page 35 of 54 Example In England, a popular silly contest is to get as many people as possible inside what’s known as a phone box (a phone booth). Below are photos of what a phone box looks like as well as an image of one of these contests in action. Here is a random sample of 7 such contests. The number of people inside the phone box per contest is recorded. We will assume that the number of successful contestants per contest is normally distributed. 5, 7, 10, 8, 9, 12, 9 It is of interest to build a 95% confidence interval for the average number of people a contest can squeeze inside a phone box. So we need to find the lower and upper limits. Lower limit: x−t⋅ s n Upper limit: x+t⋅ s n Page 36 of 54 Let’s break it down. n = 7 because we have 7 contests. How do we get x and s? The easiest way is to list our seven prices in StatCrunch and then go to Stat Summary Stats Columns. Therefore x = 8.57143 and s = 2.22539. Be sure not to confuse standard deviation with standard error! Now finally, we need to get the T score. Draw a curve with .95 in the middle and find the area of both tails: Next put in the right tail area = .025 in the T Calculator AND put DF = 7 - 1 = 6. Page 37 of 54 Hit Compute and you get T = 2.44691. Now we have everything we need, we can construct the lower and upper limits of the 95% confidence interval: Lower Limt=x−⋅ s n 8.57143−2.44691() 2.22539 7 ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ 6.51329 Upper Litx+⋅ s 8.57143+2.44691() 2.22539 ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ =10.62957 Page 38 of 54 So we are 95% confident that the population average number of people that can fit inside a phone box is between 6.51329 and 10.62957. Extra Question According to our confidence interval, is it likely that the population mean number of successful contestants is 12? Extra Question According to our confidence interval, is it possible that the population mean number of successful contestants fitting inside a phone box is 7? So, like before, any value that’s inside the interval is a likely value for the population mean µ. Any number below or above the value is not. Using StatCrunch Whenever we have actual data like in the above phonebox example, we can put this data into StatCrunch, which will actually calculate these intervals for us. First, put the seven numbers in a column on StatCrunch. Page 39 of 54 Stat T Statistics One Sample with data Choose the column you have put the data in and hit “Next”. Choose “Confidence Interval” and type in 0.95. Page 40 of 54 Hit Calculate and here are our results: Except for some slight rounding, we get the same limits: Lower limit of the confidence interval = 6.51328 Upper limit of the confidence interval = 10.62958 Let’s do an example like this where we have to calculate the limits using the summary statistics and not the actual data. Example Another Holy Grail example! :-) We want to build a 90% confidence interval for the average airspeed velocity of an unladen swallow, in miles per hour. Suppose a sample of 16 European swallows are studied, and we are told thatx = 24 and s = 2.55. Assume the population is normal. Let’s do this by hand first. Page 41 of 54 Now let’s use StatCrunch to create this confidence interval for us. Stat T Statistics One Sample with summary Put in our sample mean, sample standard deviation and sample size just like this: Page 42 of 54 Hit Next, and choose a 90% confidence interval. Hit Calculate and here are the results: These are the same values we calculated by hand! Page 43 of 54 Example Suppose we obtain a sample mean of 2000, and a standard error of 300 on a sample size of 9. We want to build a 92% confidence interval for the population mean. There is just enough information that StatCrunch cannot be used directly, so we need to find it by hand. Assume a normal population. What are the two ways we can decrease the width of this confidence interval? Another study conducted made three different confidence intervals from the same sample (and thus sample size is the same). These intervals are: (1400, 1600) (1300, 1700) (1470, 1530) We know that, in some order, these are 89%, 93%, and 97% confidence intervals. Match up each interval with its most likely confidence level. What is the point estimate in this new sample? Page 44 of 54 There is one issue that needs to be addressed, which otherwise can lead to some confusion. Therefore, let’s derail that confusion right now. When do we use Z, and when do we use T in the confidence interval equations? The answer is straightforward: (1) If it’s a proportions problem, you use Z. (2) If it’s a means problem, you use T. So, as mentioned earlier, it’s critical to first identify the problem as means or proportions! Example Identify the following scenarios as means or proportions: We want to estimate the average number of hours of TV viewing per week in a city. In a sample of size 55, 33 people own a dog. A random sample of two hundred people showed that seventy-one percent of them have had chicken pox. We sample ten roller coasters at random and calculate their average maximum G force. Page 45 of 54 Section 8.4: How Do We Choose the Sample Size for a Study? Sometimes before setup of an experiment/survey, we know that we want the margin of error to be a certain amount. Estimating Sample Size for Proportions Suppose we are trying to get results for an election, and we are going to be getting a sample proportion for the proportion of people who will vote for candidate A, maybe we know that whatever sample proportion we get, we want that to be within 3% of the true population proportion for ALL voters. So we know we want the margin of error to be equal to 3%. We can use a formula to tell us what sample size we need to take so that our margin of error will be 3%, or in other words, so we can be sure that whatever sample proportion we get will be within 3% of the true population proportion with a certain level of confidence. Here is that formula for choosing sample size in estimating a population proportion: n= p 1− () z 2 m Here p is a guess at the value we think we might get for the sample proportion. This is often a sample proportion from a previous study, if one is available. Page 46 of 54 The rule is, if you have no idea what the sample proportion could be, then set p =.50. The m is the margin of error. The Z-score is calculated again based on the level of confidence, just like with the confidence intervals. It represents how confident we want to be that the sample proportion we get will be that close to the true population proportion. Why is this the formula? Recall that for proportions, the formula for margin of error was given by the following: m=z p 1− () n But now we are given margin of error, and we want to know what n is necessary to achieve such a margin of error. That is, we want to solve for n in the above equation. m=z p 1− () ⇒ z = p1− () m 2 z = p1− () n nm 22 n 2 Why do we choose p=.50 when nothing is known about what the sample proportion could be? The answer is as Page 47 of 54 follows: both m (margin of error) and z (from confidence level) are specified, and so the only piece that’s unknown is p 1− () . We therefore want to assume the “worst-case scenario”, which is taking a larger sample so that we will get within that margin of error, regardless of what sample proportion we end up obtaining. Thus, the idea is to make p 1− () as large as possible. It can be shown (see the graph below) that choosing p =.50 is the choice that maximizes () . 0.0 0.2 0.4 0.6 0.8 1.0 0.05 0.10 0.15 0.20 0.25 Highest Point Occurs When P-Hat = .50 p-hat p-hat*(1 - p-hat) Page 48 of 54 Example Some people enjoy dunking cookies into their cups of tea before eating the cookie. We are going to find a sample proportion of people that are “dunkers.” How large a sample size should we take to estimate the proportion to within 0.03 with probability .95? What we are saying here is that we want to take a sample and get a sample proportion. We then want to create a 95% confidence interval around that sample proportion, and we want the margin of error for that confidence interval to equal 0.03. Sample size formula: n= p 1− () z 2 m It is very helpful to first label each number. p = m = z = Draw the curve with .95 in the middle and find the area of both tails: Page 49 of 54 Now we can use the formula. n= p 1− () z 2 m .501−()1.96 2 .03 =1067.11111⇒1068 Thus, in order to be 95% confident we will get a margin of error that’s no greater than .03, we will need to take a sample size of 1068 people. You’ll notice we rounded the decimal up, even though the nearest whole number was down. Why? Because if we round down, then we will almost but not quite be within a margin of error of .03. But if we take one more person (the 1068th), we will be. So with these formulas, if you get a decimal answer, always round up, since sample size has to be a whole number. Example Consider the previous scenario, only this time a similar study conducted in Edinburgh gave a sample estimate of 78% of people dunking cookies in their tea. Using that study as a guideline, find the new sample size required. What is the advantage to having a good idea what the sample proportion might be? Page 50 of 54 Now suppose, all other things being equal, we wanted to be 98% confident of our answer, rather than 95%. Will we need to take a larger or smaller sample size? Estimating Sample Size with Means What if we are not dealing with a population proportion example, but a population mean example? That is, we want to know what sample size we need so that the sample mean we get is close enough to the true population mean. For example, maybe we want to estimate the income for an entire company. We want to take a sample of their employees, and get a sample mean of their income. And we want this sample mean income to be within $5000 of the entire company’s mean income with 95% confidence. We can determine what sample size is needed so that whatever sample mean income we get, it will be within $5000 of the population mean income, and we can be 95% confident of that. Here is the formula we use to determine sample size for estimating the population mean: n= σ 2 z m Page 51 of 54 where σ is the provided standard deviation (this will be given), m is the margin of error, and Z is obtained just like before. Why is this the formula? Recall that for means, the formula for margin of error was given by the following: m=t⋅ s n But now we are given margin of error, and we want to know what n is necessary to achieve such a margin of error. Like before, we can solve for n. Notice that since σ is given, we might as well say σ instead of s. m=t⋅ n ⇒ t m=tn t m 2 The problem is that the t-score in the above last step depends on degrees of freedom, which depends on sample size...which is what we are trying to estimate. Thus, the next best thing we can do to get an estimate is to use z instead of t. Thus, the formula to use is n= σ 2 z m Page 52 of 54 Example An estimate is needed of the mean height of women in Ontario, Canada. A 95% confidence interval should have a margin of error of 3 inches. A study ten years ago in this province had a standard deviation of 10 inches. Let’s label the numbers: σ = z = m = (a) About how large a sample of women is needed? (b) About how large a sample of women is needed for a 99% confidence interval to have a margin of error of 3 inches? (c) All others things being equal, what will happen to required sample size if we only require a margin of error of 4 inches rather than 3? Page 53 of 54 Term Proportions Means Point Estimate p x Standard Error 1− () n s n Margin of Error z⋅ p ()t⋅ Letter for Confidence Z T Confidence Interval p ±z⋅ 1−p () n x±t⋅ s n Estimating Sample Size = () 2 m = σ 2 z m Assumptions (Proportions) (1) Random Sample (2) np ≥15 (3) 1− () Assumptions (Means) (1) Random Sample (2) Either n > 30, or normal population Page 54 of 54 Chris O'Neal Test 3 Notes (Chapter 8)