Project #2 : Statistical Analysis Report (Statistics 201) Submitted To: Professor Jamie Paul University of Tennessee Report Prepared By: Miranda Ceane Undergraduate Student, Business Department University of Tennessee: Knoxville, TN 37916 October 10, 2016 Executive Summary This report summarizes the analysis results associated with the survey given to us by Professor Jamie Paul. Out of 1015 original responses of the survey, 980 were used in this data set. Missing observations were removed as well as entire questions that were not answered correctly, badly worded, or contained any other biases. The purpose of this report is to document and graphically show the survey’s data and relationships between specific variables. The technology used for this report is JMP, statistical software that makes analysis of data quick and accurate if used correctly. I also used Microsoft Excel to create the pivot charts, which cannot be done in JMP. Both of these are responsible for all of my graphs, data charts, and statistics summaries. Section 1 of this report focuses on the random sample size taken from the original data set. This random sample is used throughout Section 2 and 3 to answer the questions as well. Section 2 focuses on pivot charts/tables and interpreting data from them. In this section we see what kind of effect outliers and skewed data have on the entire data set and its’ graph. While I did this project on my mac, I had to do the pivot charts/tables on a seperate computer that had the windows version of excel. Section 3 pertains to decision trees, histograms, and determining what RSquare and correlation coefficients mean in terms of the relationship between variables. Section 1 My student ID number is 000414991 so I took a random sample of 8 91 from the file “Project 2 JMP Data File.” Section 2 Parts A through E Below is a pivot table of the random sample data that displays the average high school GPA (Q7), the students’ distance from campus (Q5), and if the students were born in Tennessee or not (Q3) . The average high school GPA is a numerical value, and has been averaged and simplified to three decimal points. Using the pivot table, I also created a pivot chart including all of the variables. (Q3, Q5, and Q7). Part F This pivot chart has some correlations, but a few conclusions can be drawn from it. First of all, it looks as if students who were born in Tennessee have a higher GPA than those who were not. Secondly, the data also shows that while GPAs still remain high, those who live in the Fort have generally lower GPAs than any of the other driving categories with the exception of those who have a 10-30 minute drive to campus and were not born in Tennessee. It also shows that the students who have a 30+ minute drive have the highest GPAs out of the entire sample. Below is another pivot chart that shows how the graph differs when the numbers are changed to a count rather than an average. Part G and H In this pivot chart, the students with the 30+ minute drive to campus are poorly represented in this sample. There are so few of them, that their information skews the rest of the data completely, and can lead to false conclusions about the data as a whole. Other than that, while I would think that living distance from campus would affect GPA, the data shows that there is not a strong relationship between the variables. Below is a pivot chart that has been altered to disregard the data of the 21 students who have a 30+ minute commute to campus. Part I Section 3 Part A Below is a histogram of a quantitative variable I have interest in: “ Q17-Fluent Languages”. The x-axis shows how many languages one individual may know, and the y axis determines how many people of the survey speak a particular number of languages. Part B Below is a decision tree with three different splits from the variable “Q17-Fluent Languages”. Part C The relationship between the variables is positive. That positive relationship can be determined by simply looking at the graph which is not linear, but it is clearly increasing on the y axis . The RSquare value can also be looked at to determine the relationship of the graph. The RSquare value for this decision tree is positive but small , at only 0.085 . This means that the relationship between the variable “Q17-Fluent Languages” and the split variables is weak, but positive. Part D Part E T he value of RSquare is 0.021741 which translates to 2.17%. This value determines the strength of the association by measuring how close the data is to the regression line. The higher the number of RSquared, the more of the variability of the data is explained. So in this situation, 2.17% of the variability of the data is explained. Part F The data is not “statistically significant” because R, is 0.147 (rounded to three decimal places). R is the correlation coefficient which is determined by taking the square root of RSquared. The closer it is to 1 or -1 the more “statistically significant” the correlation is. Part G The direction of the as sociation is positive, also stated in part C. The slope determines the direction of the association. The slope in the bivariate fit chart is 0.0295092 which is a small, but positive number. Working with this survey’s data set has been very interesting. I hope these charts and conclusions help aid yo u in your statistical analyses. If you have any questions, feel free to contact me. Regards, Miranda Ceane