Chapter 7: Sampling Distributions Chapter 7 is about two distributions: 1. Sampling Distribution of the Sample Mean 2. Sampling Distribution of the Sample Proportion These sound complicated, but they really aren’t that bad. Before we get to them, let’s review what we learned in Section 6.2. We learned how to find probabilities like this: Test scores are normally distributed with a population mean of 82 and a population standard deviation of 10. What is the probability that a student scored higher than an 85? To do this, we would use the Normal Calculator in StatCrunch. Notice in this example that we are finding the probability for just one student. Soon we will learn how to solve the following problem: Test scores are normally distributed with a population mean of 82 and a population standard deviation of 10. A random sample of five students is taken. What is the probability that the average test score for these five students is above an 85? So we don’t just care about finding the probability that one student scored above 85, we want to find the probability that the average score for five students would be above 85. Page 1 of 28 We will be able to do this using our StatCrunch Normal Calculator, but instead of looking at the distribution of the individual values (like we looked at before), we need the distribution of the possible sample means we could get from samples of a certain size. We just need to create a curve of all the possible sample means we could get. Example Let’s say that the population we are interested in is everyone in this class. And let’s say we are interested in the average number of siblings for the class. Suppose I told you that I got everyone’s information, and the average number of siblings for everyone in the class is µ = 1.5. For now, assume that this number is normally distributed. This would be a population mean because it is the average for the whole class. Now, what if I took a sample of 5 people, asked them how many siblings they had, and averaged the values for these 5 people. Let’s do it: Person # Siblings 1 2 3 4 5 The average from these 5 people would be a sample mean because it is just for these 5 people, not everyone in the class. And the calculated sample mean for this sample is x=_______ Page 2 of 28 Let’s do it one more time for 5 more people: Person # Siblings 1 2 3 4 5 x=______ Sometimes we will get a sample mean above the population mean, sometimes we will get a sample mean below the population mean, and occasionally we might get a sample mean right at the population mean. Here’s the idea: let’s say I went and did this for every combination of 5 people, and calculated a sample mean for every combination of 5 people. What we are interested in is seeing what the distribution of all these possible sample means would look like. That distribution is what we call the Sampling Distribution of the Sample Mean. As an illustration, suppose we took 100 total samples, each of size 5, and discovered that the following were reported averages: 1.2, 1.4, 0.8, 1.6, 1.0, 1.6, 1.8, 1.0, 1.2, 1.4. (The remaining sample means are not shown!) Now let’s consider all sample averages we found and plot them on a graph: Page 3 of 28 Sampling Distribution of the Sample Mean of Number of Siblings 0.5 1.0 1.5 2.0 2.5 0 2 4 6 8 10 12 14 This is a graph of all 100 sample means we found. Recall that we assumed that the number of siblings an individual has was normally distributed, and that the population mean was µ = 1.5. What can you say about the above graph? It too seems to have an approximately normal distribution, and furthermore, the center of the graph appears to be 1.5 as well. The above graph is the sampling distribution of sample averages, since it is a plot of the 100 different samples we took. As we’ll see soon, it turns out the sampling distribution will be approximately normal under certain conditions, with the same mean as the population. Page 4 of 28 To find probabilities involving these sample means, like the probability that a sample of five people will have an average of more than 2 siblings per person, we will have to use the StatCrunch Normal Calculator. So we will need three things: 1. What is the mean, or in other words, the overall average of all these possible sample means? 2. What is the standard deviation, or in other words, the spread of the values for all these possible sample means? 3. What is the shape of the distribution of these sample means? Because to use the Normal Calculator, they need to be normally distributed. We will talk about these three things, and how to calculate these probabilities soon. 7.1-7.2 : Sampling Distribution of the Sample Mean The general idea behind obtaining the sampling distribution of the sample mean is: 1. Obtain a simple random sample of size n. 2. Compute the sample mean,x. 3. Assuming that we are sampling from a finite population, repeat steps 1 and 2 until all simple random samples of size n have been obtained. Page 5 of 28 This is easy to do with a finite population with very few values. Let’s look at an example: Example: Draw all possible samples of size 2 from the population {2,4,6,8}. Construct the sampling distribution of the sample mean. What is the probability that we would get a sample mean of 5 from a sample of size 2 from this population? However, most of the time we will not have all the values from a population, because most populations we look at are very large. So to find probabilities involving sample means, like we mentioned before, we need to know what the distribution looks like. We need to know the mean, the standard deviation and the shape. First, recall from earlier chapters that the population mean is µ, while the population standard deviation is σ. Page 6 of 28 1. Mean The average or mean of all the possible sample means we could get will ALWAYS be equal to the overall population mean. In other words, the mean of the sampling distribution is µ. Check that this is true in the example on the previous page. 2. Standard Deviation When we are talking about the spread, or standard deviation, of a sampling distribution, we call it the standard error. Many students get confused by this term, but just think of the standard error as a type of standard deviation. It just measures the spread of a sample statistic like the sample mean. It just places a value on the spread of all the possible sample mean values. Definition In other words, the standard error is literally the standard deviation of the sampling distribution of the sample mean. When we are talking about the sampling distribution of the sample mean, the standard error = σ n . Page 7 of 28 Why? Individual values in a population are going to be all over the place. Some will be high, some will be low, and we calculate their standard deviation to be σ. However, sample means are not going to be as spread out. Yes, we might have some high values and we might have some low values in a sample, but for the most part these values are going to average out close to the mean. So the standard error or spread is going to be quite a bit smaller than σ; it is actually σ n . 3. Shape To find these probabilities of sample means that we are looking for, we need to make sure that the distribution of these sample means is bell-shaped or normal. To check for this, we need for at least one of the following statements to be true. They can both be true, but we only require one: 1. If the population is normally distributed, then the sampling distribution of the sample mean is normally distributed as well, regardless of sample size. 2. If we are using a large enough sample size (usually we say n greater than 30), the sampling distribution of the sample mean is approximately normal, regardless of the distribution of the population. (This is the Central Limit Theorem.) Page 8 of 28 Summary (Sample Mean) Consider a population of values with mean µ and standard deviation σ. Now suppose we select a random sample of size n and compute x. Then the following are true: (1) The mean of the sampling distribution is µ. (2) The standard deviation of the sampling distribution (which is standard error) is n . (3) The shape of the sampling distribution will be normal if at least one of the following is true: (a) The population we sampled from is normal, or... (b) The sample size is large: n > 30 (Central Limit Theorem) (c) If both of these are false, then we can make no conclusion about the sampling distribution’s shape. Based on all these things we have talked about, we can now find probabilities involving sample means in StatCrunch. We just have to make the following changes: 1. Make sure the sampling distribution of the sample mean is normally distributed. We do this by checking if the population we are sampling from is normal OR if we are using n > 30. 2. When we want to find probabilities/areas under the curve involving sample means, we will still put in µ for the Mean in StatCrunch, but we will now put in σ/√n for the standard deviation. Other than that, everything else will be the same. Page 9 of 28 We use this new standard deviation (called the standard error) because sample means are not as spread out as much as individual data values, so the new standard deviation is σ/√n rather than just σ. Example Suppose a single value is selected from a normal population with mean µ = 5 and standard deviation σ = 1. Use StatCrunch to find the probability that the value is greater than 5.5. (This is exactly what we did in Chapter 6.) Now suppose a sample of size 25 is selected from this population. Use StatCrunch to find the probability that the sample mean for these 25 values is greater than 5.5. What do we need to change to do this? (Notice that the standard deviation we use in our StatCrunch calculator is now the standard error.) Page 10 of 28 Example Suppose a simple random sample of size n = 36 is obtained from a population with µ = 30 and σ = 12. a) Describe the shape of sampling distribution of the sample mean. Why is it this shape? b) What is the probability that the sample mean is greater than 34? c) What is the probability that the sample mean is less than 28? Page 11 of 28 Example Suppose the scores on Test 1 have a mean µ = 82 and standard deviation σ = 10. Suppose we take a sample of n = 25 students. a) If we took many samples of 25 students, and computed a sample mean test score for each sample, what would the standard deviation (spread) for these sample mean test scores be? What do we call this? b) If we wanted to calculate probabilities involving these sample means, what must be true regarding the population? c) Assuming that the condition in (b) is met, what is the probability that the sample mean of test scores for a sample of size n = 25 randomly selected students is higher than 83? Example The weight of a carton of strawberries is skewed right with mean 14 ounces and standard deviation 1.5 ounces. Suppose we take a random sample of 16 strawberry cartons. Can we use the normal calculator to find the probability that the sample average of the weights of these 16 cartons will be greater than 16 ounces? Why or why not? Page 12 of 28 Example Consider a population that is normal with mean 50 and standard deviation 4. Also suppose we select a sample of size n = 625. Which graph below represents the population? Which graph below represents the sampling distribution of the sample mean? Page 13 of 28 7.3: How can we make predictions about a population? We are looking at three different distributions: 1. Population Distribution – the entire distribution from which we take the sample. 2. Sample (Data) Distribution – the distribution of the sample data for a particular given sample. The shape of the sample mirrors the population. For instance, if the population is normal, then one sample drawn from it will be normal as well. On the other hand, if a distribution is skewed left, one sample from it will also be skewed left. 3. Sampling Distribution – The probability distribution of a sample statistic, such as a sample mean. It is a distribution of all the possible values for the sample statistic. The shape of it will be approximately normal under the conditions previously mentioned. Example Consider our good friends from Monty Python and the Holy Grail...The Knights Who Say “Ni!” The distribution of the number of times in one minute that a knight says “Ni!” is skewed to the right with population mean µ = 5.2 and population standard deviation σ = 3.0. Page 14 of 28 These values are not known to King Arthur, who randomly chooses various minutes to count how many times in the selected minute that a knight says “Ni!” For a random sample of 36 different minutes, he gets a mean of 4.6 and a standard deviation of 3.2. a) What are the mean, standard deviation, and shape of the population distribution? b) What are the mean, standard deviation, and shape of King Arthur’s sample? c) What are the mean, standard error, and shape of the sampling distribution of x? d) Now suppose King Arthur experiences a burst of courage and decides to select 90 minutes at random. He records how many times in each minute a knight says “Ni!”. So now n = 90. What is the new mean and standard error of the sampling distribution? Page 15 of 28 Two important points: 1) When we increase sample size, the mean of the sampling distribution does not change. 2) However, as n increases, standard error decreases. RECAP: Notation and Terminology The following table summarizes the symbols we have used for each of the three types of distributions. Term Population Sample Sampling Distribution Mean Standard Deviation μ x μ σ s σ n Sample standard deviation (s) measures the spread of data values in just one sample. Standard error is a type of standard deviation that measures the spread of the possible sample statistic values. For example, standard error measures how spread out the possible sample means are from different samples. Page 16 of 28 7.1: SAMPLING DISTRIBUTION OF THE SAMPLE PROPORTION There is one other Sampling Distribution we are interested in, the Sampling Distribution of the Sample Proportion. Instead of looking at an average from a sample, maybe we want a proportion from a sample that fit a certain category. Example Let’s say again that the population we are interested in is everyone in this class. And let’s say we are interested in the proportion of women in the class. Imagine that we already got everyone’s information, and the proportion of women in the class is p = 0.55. This would be a population proportion because it is the proportion for the whole class, and we use the letter “p” to represent this population proportion. Now suppose we took a sample of ten people, found their gender, and found the proportion of women for these 10 people. #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 The proportion from these 10 people would be a sample proportion because it is just for these 10 people, not everyone in the class. Page 17 of 28 And the calculated sample proportion for this sample is p = _______. This is the notation we use to denote a sample proportion. It is pronounced “p-hat”. Let’s do it one more time for 10 more people: #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 p = _______. Sometimes we will get a sample proportion above the population proportion, sometimes we will get a sample proportion below the population proportion, and occasionally we might get a sample proportion right at the population proportion. Here’s the idea: let’s say I went and did this for every combination of 10 people, and calculated a sample proportion for every combination of 10 people. What we are interested in is seeing what the distribution of all these possible sample proportions would look like. That distribution is what we call the Sampling Distribution of the Sample Proportions. The distribution of all possible sample proportions is the sampling distribution of the sample proportion. Page 18 of 28 Example I might know that for all STAT 2000 students, the proportion of the population that got an A on Test 1 is .45. This is the population proportion. I can also take samples from this population and get sample proportions. I might take a sample of 30 students and find that 12 out of 30 got an A, which would give me a sample proportion of 12/30 = 0.40. Or I might take another sample of 30 students and find that in that sample 15 out of the 30 students got an A, giving me a sample proportion of 15/30 = 0.50. The population proportion will always be .45, but the sample proportion changes from sample to sample. Example Select all possible samples of size 2 from the population {1,2,3,4}. Calculate the proportion of even numbers in each sample. Page 19 of 28 Notation Just like we have a population mean, µ, we have a Population Proportion: p Example In the dataset {1,2,3,4}, calculate the population proportion for the proportion of even numbers in the dataset. Notation Just like we have a sample mean,x, we have a Sample Proportion: p = x n where x is the number of individuals in the sample with the specified characteristic, and n is the sample size. Example For the sample {2,3} which we found in the example on the page before, our x would be 1 because we have one even number, and our n would be 2 because our sample size is 2. So our p would be 1/2 or 0.5. Page 20 of 28 The Sampling Distribution of the Sample Proportion Just like the previous sampling distribution of the sample mean, this new one also has a center, spread, and approximately normal shape under certain conditions. It can be shown that the following are properties: (1) The mean is equal to the population proportion: p (not a sample proportion) (2) The standard deviation of the sampling distribution (also known as the standard error) is given by p1−() n . (3) The sampling distribution of p will be approximately normal when n is large. How large? We require both of these properties to be true: (a) n*p > 15, and (b) n*(1 - p) > 15 Be careful about the standard error. We have two possible expressions now: σ n and p1−() . The difference is that σ n is used for the sampling distribution of the sample mean, while p1−() n is used for the sampling distribution of the sample proportion. Therefore it is critical that we identify the problem as a means or proportions problem first; otherwise we run the risk of using the wrong formulas! Page 21 of 28 Example Consider a very large population of adults where approximately 45% of the adults enjoy playing DDR (Dance Dance Revolution). Suppose samples of size 275 are selected from this population, and the value of p is recorded for each sample. a) Will the sampling distribution of be approximately normal? b) If so, what are the mean and standard error for this distribution? In other words, find the center and spread. c) Using the Empirical Rule, about 68% of sample proportion values will be between _____ and _____. d) Using the Empirical Rule, almost all sample proportion values will be between _____ and _____. Page 22 of 28 e) What is the probability of getting a sample proportion of .50 or higher (where 50% or more of the sample enjoy playing DDR) from a random sample of 275 people? To answer this question, we have already established that the sampling distribution is approximately normal, so we can use the normal calculator. We put in the mean (p = .45) and the standard error (s.e. = .03), then find the probability above .50. Page 23 of 28 f) Would a sample proportion value of 0.60 be unusual? What would obtaining a sample proportion of 0.60 say about our population proportion? (Similar to HW!) We can answer this question in two ways. First, we can actually find the probability just like we did in question (e), only this time above .60: The probability of getting a sample proportion of .60 or higher is a mere .00000028665 (the output above is in scientific notation), so this is a highly unusual event. In other words, if the population proportion really were .45 and the standard error really were .03, it would be very unusual to observe a sample proportion as extreme as .60. Page 24 of 28 Now suppose we actually did obtain p =.60. Then one of two statements must be true: (1) Either we have observed an extremely rare event, or... (2) The assumption that the population proportion p = .45 most likely is not correct after all. This is the more likely outcome. In other words, we would have strong evidence that the population proportion is not .45, as originally thought. The second way to answer this question is to use our Empirical Rule estimates. Recall that almost all the sample proportions will lie between .36 and .54. That means that a sample proportion that is computed from a sample will almost certainly be some number between .36 and .54. Therefore it would be very unusual to observe anything less than .36, or higher than .54. The value of interest .60 lies above our “cutoff” of .54, so it would be highly unusual to observe it. So, we would reach our same conclusion: the assumption that the population proportion is .45 most likely is incorrect after all. Page 25 of 28 Summary (Sample Proportion) Consider a population in which the population proportion is p. Now suppose we select a random sample of size n and compute p . Then the following are true: (1) The mean of the sampling distribution is p. (2) The standard deviation of the sampling distribution (which is standard error) is p1−() n . (3) The shape of the sampling distribution will be normal if both of the following are true: (a) n*p > 15 (b) n*(1 - p) > 15 (c) If either of these are false, then we can make no conclusion about the sampling distribution’s shape. RECAP: Notation and Terminology The following table summarizes the symbols we have used for proportions and the sampling distribution. Term Population Sample Sampling Distribution Mean Standard Error p p p --- --- 1−() n Page 26 of 28 Finally, here is an overall summary table comparing both sampling distributions. Let me stress again how important it is to first identify the problem as a means or proportions problem! Mean Situation Proportion Situation Population Parameter Sample Statistic Mean of Sampling Dist. Standard Error Normal if... μ p x μ p σ n 1−() n (1) Population is normal, or... (2) n > 30 (CLT) (1) n*p > 15, and... (2) n*(1 - p) > 15 Page 27 of 28 Page 28 of 28 Chris O'Neal Test 2 Notes (Chapter 7)