PAGE PAGE 2 Cautions about Correlation and Regression regression- see scatter of data points about regression line sum or squares of vertical distances from the points to the regression line are as small as possible distances represent ?left-over? variation in the response after fitting the regression line- distances are known as residuals Residuals A residual is the difference between an observed value of the response variable and the value predicted by the regression line. residual = observed y ? predicted y residual= y ? EMBED Equation.3 Recall the fat gain versus NEA increase data least squares regression line: fat gain = 3.505 ? (0.00344 * NEA increase) one subject: NEA increase= 135 calories fat gain= 2.7 kg predicted gain: EMBED Equation.3 = 3.505 ? (0.00344 * NEA increase)= 3.04 kg observed gain: y = 2.7 kg residual = observed y ? predicted y residual = 2.7 kg ? 3.04 kg = -0.34 kg residuals for 16 data points: 0.37 -0.7 0.1 -0.34 0.19 0.61 -0.26 -0.98 1.64 -0.18 -0.23 0.54 -0.54 -1.11 0.93 -0.03 To assess the fit of a regression line you could: ?look at the vertical deviations of the data points from the regression line ?look at a residual plot (easier to study) Residual Plots A residual plot is a scatterplot of the regression residuals against the explanatory variable. ?the mean of the residuals of a least-squares regression is always zero the line (residual = 0) in residual plot corresponds to the fitted regression line (Figure 2.20) the residual plot magnifies the deviations from the line to make the patterns easier to see ?if regression line catches the overall pattern of the data, there should be no pattern in the residuals (irregular scatter- randomly distributed above and below zero) residuals in Figure 2.20 have this irregular scatter ?don?t have an irregular horizontal pattern in residual plot- this demonstrates regression line fails to capture overall pattern of relationship between y and x #1) linear relationship between x and y (regression line captures overall pattern, irregular horizontal scatter in residual plot) residuals in Figure 2.20 have this irregular scatter #2) scatterplot has fanned out pattern (residual plot fans out, regression line does not capture the fact that vertical scatter increases as x gets bigger) (Figure 2.3 and Figure 2.21) field measurements versus lab measurements for defects in Oil Pipeline residuals are more spread out above and below (residual=0) line as we move to right field measurements are more variable as true defect depth increases more vertical scatter as x gets bigger regression line doesn?t catch the important fact that the variability of field measurements increases with defect depth (easy to see in residual plot) #3) scatterplot has curved pattern (regression line not capture the fact that the data has curved pattern) Outliers and influential observations diabetics must manage their blood sugar levels carefully FPG- daily measure HbA- one-time measurement made at regular medical check-up expect a positive association between variables Figure 2.22- regression of FPG (y) on HbA (x) correlation r= 0.4819 the correlation is surprisingly low- possibly because of outliers Subject 15- dangerously high FPG (large residual) Subject 18- large HbA (x) value, but close to regression line outlier- observation that lies outside the overall pattern of the other observations observation is influential for a statistical calculation if removing it would markedly change the result of the calculation to assess influence on regression line- run regression both with and without the suspicious observation Subject 15 weakens the linear pattern drop Subject 15: r increases from 0.4819 to 0.5684 (add Subject 15- correlation is lower) Subject 18 extends the linear pattern drop Subject 18: r drops from 0.4819 to 0.3837 (add Subject 18- correlation is higher) omit Subject 18- pulls regression line up omit Subject 15- pulls regression line down influence in neither case is very large influence of Subject 15 is larger in this sense: regression line without Subject 15 and regression line with all data are ?far? apart regression line without Subject 18 and regression line with all data are ?close? together Beware the lurking variable Limitations of correlation and regression: ?correlation measures only linear association- fitting a straight line makes sense only when the overall pattern is linear ?extrapolation (using a fitted model far outside the range of the data that we used to fit it) often produces unreliable predictions ?correlation and least-squares regression are not resistant ?the relationship between two variables can be understood only by taking other variables into account lurking variable- a variable that is not among the explanatory or response variables in a study and yet may influence the interpretation of relationships among those variables EX 1- strong correlation between how much math minority students took in HS and their later success in college ?math is the gatekeeper for success in college? maybe? lurking variable? EX 2- private health spending and goods imported (Figure 2.2) r= 0.9749 lurking variable? ?nonsense correlation?- no causal relationship Association does not imply causation An association between an explanatory variable x and a response variable y, even if it is very strong, is not by itself good evidence that changes in x actually cause changes in y. Lurking variables can hide the true relationship between x and y EX- a study of housing and health in a city in England x- index of overcrowding (higher #- more overcrowding) y- index of lack of indoor toilets (higher #- fewer indoor toilets) expect high correlation (both measures of inadequate housing), but r only = 0.08 some wards dominated by public housing (always have indoor toilets)- high x, low y some wards lacked public housing (lack of indoor toilets)- high x, high y Because the relationship between x and y differed in the two types of wards, analyzing all wards together obscured the nature of the relationship. Figure 2.26 The restricted-range problem If both x and y respond to the same underlying unmeasured variables, x may help us predict y even though x has no direct influence on y. How well do SAT scores and HS grades predict college grades? Correlation r between explanatory variables and college GPA SAT scores HS grades SAT scores + HS grades r= 0.36 r= 0.42 r= 0.52 0.522= 0.27. SAT scores plus HS grades explain only 27% of the variation in college GPAs among college students. What is going on? ?Princeton students- high SAT scores and high HS GPA ?generic state school- middle range of SAT scores and HS GPA both sets of students receive the full spread of grades Princeton- elite HS students get As, Bs, and Cs Princeton- if ave HS students were admitted, we suspect they would primarily receive Cs state school- ave HS students get As, Bs, and Cs state school- if elite HS students attended, we suspect they would primarily receive As If all types of HS students attended all types of colleges, there would be a much stronger relationship between HS performance and college performance. Restricted-range problem- the data do not contain info on the full range of both explanatory and response variables. When data suffer from restricted range, r and r2 are lower than they would be if the full range could be observed.