Holly McWilliams Statistics 201 24 November 2009 Zaretski Statistics Project 2 a. Because there are millions of college students throughout the United States and the entire world, thus our few statistics classes cannot represent the college students as a whole. b. UT, although smaller than the entire college population is still a huge amount of people and our statistics classes are one of many UT college classes and thus not a good representation of UT students as a whole. c. There are a good amount of students who did not complete the survey. There could be a bias among those who did complete the survey, thus their answers would not be a good representative of all who are in the statistics classes. 2) Confidence interval for a proportion: I chose to analyze what year my fellow students are in school. I chose the amount of Juniors in Statistics 201. The proportion is 41.35% A) 5. What year are you? Frequencies Level Count Prob Freshman 11 0.01903 Sophmore 251 0.43426 Junior 239 0.41349 Senior 65 0.11246 Other 12 0.02076 Total 578 1.00000 B) My student ID ends in 2, so my sample size is 47. I am 90% confident that the true proportion of Statistics 201 students fall between 29% and 52% of Statistics 201 students. Frequencies Level Count Prob Sophmore 20 0.42553 Junior 19 0.40426 Senior 7 0.14894 Other 1 0.02128 Total 47 1.00000 90% Confidence Intervals Level Count Prob Lower CI Upper CI 1-Alpha Sophmore 20 0.42553 0.314162 0.545008 0.900 Junior 19 0.40426 0.294854 0.524079 Senior 7 0.14894 0.082813 0.253277 Other 1 0.02128 0.004761 0.089907 Total 47 C)This shows that the success/failure condition holds so the confidence interval above is appropriate. np= (47)(.40)= 18.8> 10 nq=(47)(.60)= 28.2> 10 D) .40 is between .29 and .52, so yes the true p is in the 90% confidence interval. 3) Confidence interval for a mean: A) 18 Days per Week Consume Alcohol Moments Mean 1.5670725 Std Dev 1.5058685 Std Err Mean 0.0625818 upper 95% Mean 1.6899879 lower 95% Mean 1.4441572 N 579 The average number of days per week Alcohol is consumed by all respondents is 1.567. B) n = 45+2= 47 33. $ Spent on haircut Moments Mean 29.170213 Std Dev 30.194528 Std Err Mean 4.4043246 upper 95% Mean 38.035658 lower 95% Mean 20.304767 N 47 The histogram is unimodal and skewed to the right but has a few outliers; the biggest one is at 150. The outliers give the histogram an even more prominent skewness to the right. It is not symmetrical. To be nearly normal the distribution must be unimodal and symmetric. Because the histogram is significantly skewed to the right, it is not nearly normal. 33. $ Spent on haircut Mean 26.543478 Std Dev 24.504609 Std Err Mean 3.6130074 upper 95% Mean 33.820449 lower 95% Mean 19.266508 N 46 I did another histogram without the outlier at 150. The mean has gone down from 29 to about 27. The new histogram is unimodal but skewed to the right and not symmetrical. 33. $ Spent on haircut Mean 23.204545 Std Dev 19.131831 Std Err Mean 2.884232 upper 95% Mean 29.021154 lower 95% Mean 17.387937 N 44 I took out more two more outliers of 100 and the model is beginning to look more symmetric and slightly unimodal. The model is more nearly normal. 33. $ Spent on Haircut Mean 21.883721 Std Dev 17.209091 Std Err Mean 2.6243618 upper 95% Mean 27.179898 lower 95% Mean 16.587544 N 43 I took out one more outlier and the model is looking even more symmetric and unimodal. If I keep taking out outliers the mean will get smaller. The model now looks nearly normal with a bit of skewness to the right. 95% interval: the interval goes from 16.59 to 27.17 and the mean of 21.88 is inside the interval. So we can be 95% confident that the average amount of money people in our statistics class will spend on a haircut is between 16.59 and 27.17 dollars. 4) Hypothesis test regarding the difference in means for independent samples. Distributions 02 Gender=Female 11 Your GPA Mean 3.2715625 Std Dev 0.6576896 Std Err Mean 0.1162642 upper 95% Mean 3.5086849 lower 95% Mean 3.0344401 N 32 Distributions 02 Gender=Male 11 Your GPA Mean 2.9704167 Std Dev 0.7917316 Std Err Mean 0.1616115 upper 95% Mean 3.3047356 lower 95% Mean 2.6360977 N 24 These histograms are a bit skewed, the female to the left and male to the right, but they are both unimodal and are quite close to being symmetrical. The female histogram is most nearly normal. The nearly normal condition for moderate sample sizes (between 15 and 40). B. Hypotheses test: I am curious if males have a lower GPA than females. Ho (null hypothesis): ((males) - ((females) = 0 Ha (alternative hypothesis): ((males) - ((females) <0 Oneway Analysis of 11 Your GPA By 02 Gender T Test Male-Female Assuming unequal variances Difference -0.30115 t Ratio -1.51263 Std Err Dif 0.19909 DF 44.18649 Upper CL Dif 0.10004 Prob > |t| 0.1375 Lower CL Dif -0.70233 Prob > t 0.9313 Confidence 0.95 Prob < t 0.0687 The difference of male-female GPA is -0.30115, suggesting that the female GPA may be a bit higher on average than males. The standard error difference of 0.19909 shows that the difference is not very big between males and females GPAs in statistics 201. The p-value for the test is .0687, which is higher than the alpha .05 so we fail to reject the null hypotheses that males GPA is lower than females. So there is evidence to suggest the null hypothesis is true. C. A type II error is made when the null hypothesis is not rejected when it is false. In order to decrease the frequency of this type of error is to collect more data or evidence. 5) Contingency Analysis of 10 Parents married? By 06 Born in TN? Mosaic Plot Contingency Table 06 Born in TN? By 10 Parents married? Count Expected Cell Chi^2 No Yes No 65 65.5751 0.0050 161 160.425 0.0021 226 Yes 103 102.425 0.0032 250 250.575 0.0013 353 168 411 579 Tests N DF -LogLike RSquare (U) 579 1 0.00583144 0.0000 Test ChiSquare Prob>ChiSq Likelihood Ratio 0.012 0.9140 Pearson 0.012 0.9140 Fisher's Exact Test Prob Alternative Hypothesis Left 0.4955 Prob(10 Parents married?=Yes) is greater for 06 Born in TN?=No than Yes Right 0.5789 Prob(10 Parents married?=Yes) is greater for 06 Born in TN?=Yes than No 2-Tail 0.9255 Prob(10 Parents married?=Yes) is different across 06 Born in TN? B. Cell 1,1 has the largest chi^2 of .0050 where parents married to each other=no and born in TN=no. A large value of chi^2 means there are a lot of deviations from the hypothesized model and will give a small P-value so we would reject the null hypothesis. C. The null hypothesis: independence between parents married to each other and born in TN The alternative hypothesis: dependence between parents married to each other and born in TN The P-value= .19140 Conclude: the p-value is great than the alpha value of .05, so we fail to reject the null hypothesis and conclude that there is independence between parents still married and born in TN. D. The test disproves that there is a relationship between the two variables because we failed to reject the null hypothesis. E. Type I error was made if we rejected the null hypothesis when it was true. When you choose level alpha, you are setting probability of a Type I error to alpha. 6) Bivariate Fit of 25 Fastest Speed Achieved Driving By 04 Desired Weight (Lbs.) Linear Fit 25 Fastest Speed Achieved Driving = 52.484027 + 0.333443*04 Desired Weight (Lbs.) Summary of Fit RSquare 0.425222 RSquare Adj 0.412159 Root Mean Square Error 11.83808 Mean of Response 101.0435 Observations (or Sum Wgts) 46 Analysis of Variance Source DF Sum of Squares Mean Square F Ratio Model 1 4561.747 4561.75 32.5513 Error 44 6166.166 140.14 Prob > F C. Total 45 10727.913 <.0001 Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept 52.484027 8.688303 6.04 <.0001 04 Desired Weight (Lbs.) 0.333443 0.058444 5.71 <.0001 R^2= .425222 which represents the amount of variation accounted for by the model. B. Parameter Estimates Lower 95% 34.973904 0.2156576 Upper 95% 69.994151 0.4512284 The 95% confidence interval would be that we are 95% confident that the interval of .216 and 0.451 contains the true slope of B1. B1=.333 and is in the confidence interval. Because the slope is different from zero, there is a relationship between the fastest speeds accomplished and desired weight. The true slope is contained in my confidence interval it didn’t = 1 so I could reject my null so there is a relationship between the two. C. Hypothesis test: Ho: b1=1 Ha: b1 does not =1 When the slope equals one, it does not fit into the interval of .21 and .45 so we will reject the Ho that says B1=1. Thus there is dependence between the fastest speed accomplished in statistics 201 class and desired weight. D. The data is spread between the weights of 110 and about 190 with an outlier of 230 pounds. This could influence the r^2 value and correlation and that could cause a less accurate confidence interval. There is not a random scatter. E. If I found a statistically significant relationship between the variables, it would not mean that wishing you were heavier causes you to drive faster. Correlation/association does not mean causation. One variable does not cause another just because there is an association. Gender can influence how fast people drive and the weights of people since males typically weigh more than females. Bivariate Fit of 25 Fastest Speed Achieved Driving By 03 Weight (Lbs.) 02 Gender=Female Linear Fit 25 Fastest Speed Achieved Driving = 65.967208 + 0.2098797*03 Weight (Lbs.) Summary of Fit RSquare 0.147685 RSquare Adj 0.116118 Root Mean Square Error 8.966274 Mean of Response 94.7931 Observations (or Sum Wgts) 29 Analysis of Variance Source DF Sum of Squares Mean Square F Ratio Model 1 376.1190 376.119 4.6784 Error 27 2170.6397 80.394 Prob > F C. Total 28 2546.7586 0.0396 Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept 65.967208 13.4306 4.91 <.0001 03 Weight (Lbs.) 0.2098797 0.097033 2.16 0.0396 R^2= .147685 which represents the amount of variation accounted for by the model. b. Parameter Estimates Lower 95% Upper 95% 38.409897 93.524519 0.0107843 0.4089752 The 95% confidence interval would be that I am 95% confident that the interval of .01 and .41 contains the true slope of B1. B1=.21 and is in the confidence interval. Bivariate Fit of 25 Fastest Speed Achieved Driving By 03 Weight (Lbs.) 02 Gender=Male Linear Fit 25 Fastest Speed Achieved Driving = 59.918312 + 0.2958295*03 Weight (Lbs.) Summary of Fit RSquare 0.079533 RSquare Adj 0.018168 Root Mean Square Error 17.71756 Mean of Response 111.7059 Observations (or Sum Wgts) 17 Analysis of Variance Source DF Sum of Squares Mean Square F Ratio Model 1 406.8526 406.853 1.2961 Error 15 4708.6768 313.912 Prob > F C. Total 16 5115.5294 0.2728 Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept 59.918312 45.69197 1.31 0.2095 03 Weight (Lbs.) 0.2958295 0.259852 1.14 0.2728 R^2= .079533 which represents the amount of variation accounted for by the model b. Lower 95% Upper 95% -37.47183 157.30845 -0.258033 0.8496919 The 95% confidence interval would be that we are 95% confident that the interval of -.26 and .85 contains the true slope of B1. B1= .29583 and is in the confidence interval. Because the slope does not equal zero, there is a relationship between the two variables. New Residuals: Female: The residual plot for females indicates that the data is generally distributed between 120 and 155 pounds with four outliers at 105, 165, 175, and 185. This could possibly influence the accuracy of my results. Males: The residual plots for males shows that desired weight is spread between 155-195 pounds with two outliers at 145 and 215 pounds and these outliers could cause my R^2 value and confidence interval to be less accurate.