Qihua Fan 24 March 2010 Armagan STAT 121 Chapter 6: Exercises 2, 3, 4, 6, 13 Exercise 2 The two x-variables are examined separately. RISK In the scatterplot of studentized residuals of the model PROFIT~RISK+RD, there are no apparent patterns visible. H0: The relationship is linear. Ha: The relationship is not linear. > lm1=lm(PROFIT~RISK,data=d2) > lm2=lm(PROFIT~factor(RISK),data=d2) > anova(lm1,lm2) Analysis of Variance Table Model 1: PROFIT ~ RISK Model 2: PROFIT ~ factor(RISK) Res.Df RSS Df Sum of Sq F Pr(>F) 1 16 180072 2 8 51287 8 128786 2.5111 0.1072 Conclusion: Fail to reject H0 for RISK. The linearity assumption has not been violated for the RISK variable. RD In the scatterplot of the studentized residuals vs. RD for the model PROFIT~RISK+RD, there appears to be a curvilinear relationship between the studentized residuals vs. RD, suggesting that the linear model does not account for all of the variation in PROFIT. Since the pure error lack-of-fit test is only applicable for x-variables with repeated values, this test is not applicable to RD. Correct for the observed curvilinear relationship by adding the term RD^2. > d=read.csv("RD6.txt") > RD.CENTERED=d[,3]/(mean(d[,3])) > d2=cbind(d,RD.CENTERED) > lm3=lm(PROFIT~RISK+RD.CENTERED+I(RD.CENTERED^2),data=d2) > summary(lm3) Call: lm(formula = PROFIT ~ RISK + RD.CENTERED + I(RD.CENTERED^2), data = d2) Residuals: Min 1Q Median 3Q Max -7.35656 -2.00760 -0.07875 2.73912 5.17087 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -245.3696 14.8111 -16.567 1.36e-10 *** RISK 23.2492 0.9884 23.522 1.18e-12 *** RD.CENTERED 97.6313 22.3672 4.365 0.000647 *** I(RD.CENTERED^2) 162.7524 10.6740 15.248 4.10e-10 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 3.538 on 14 degrees of freedom Multiple R-squared: 0.9995, Adjusted R-squared: 0.9994 F-statistic: 9708 on 3 and 14 DF, p-value: < 2.2e-16 Check to see if this model is an improvement over the original. For the model PROFIT~RISK+RD.CENTERED+RD.CENTERED^2, there no longer appears to be a curvilinear relationship between the studentized residuals and RD.CENTERED. H0: The variable RD.CENTERED^2 significantly contributes to explaining the variations observed in PROFIT. Ha: The variable RD.CENTERED^2 does not significantly contribute to explaining the variations observed in PROFIT. > lm3=lm(PROFIT~RISK+RD.CENTERED+I(RD.CENTERED^2),data=d2) > lm4=lm(PROFIT~RD.CENTERED+RISK,data=d2) > anova(lm3,lm4) Analysis of Variance Table Model 1: PROFIT ~ RISK + RD.CENTERED + I(RD.CENTERED^2) Model 2: PROFIT ~ RD.CENTERED + RISK Res.Df RSS Df Sum of Sq F Pr(>F) 1 14 175.24 2 15 3085.39 -1 -2910.2 232.49 4.097e-10 *** Conclusion: The model PROFIT~RISK+RD.CENTERED+RD.CENTERED^2 appears to be an improvement over the original model. Addition of the RD.CENTERED term is significant in accounting for the variations observed in PROFIT, and appears to have taken care of the curvilinear relationship previously observed in the residual plot; the linearity assumption is no longer violated. Exercise 3 Create a matrix [current, prev]. > prev=d3[1:347,] > current=d3[2:348,] >d3.new= cbind(current,prev) > d3.new<-data.frame(d3.new) Fit a linear model of current~prev. > lm1=lm(current~prev,data=d3.new) > summary(lm1) Call: lm(formula = current ~ prev, data = d3.new) Residuals: Min 1Q Median 3Q Max -478.437 -13.663 -3.824 17.329 394.293 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 8.806181 6.236058 1.412 0.159 prev 0.998917 0.003629 275.268 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 85.48 on 345 degrees of freedom Multiple R-squared: 0.9955, Adjusted R-squared: 0.9955 F-statistic: 7.577e+04 on 1 and 345 DF, p-value: < 2.2e-16 Plot fitted values vs. studentized residuals for lm1. > plot(fitted.values(lm1),rstudent(lm1),xlab="Fitted Values",ylab="Studentized Residuals",main="Fitted Values vs. Studentized Residuals for lm1") Test for nonconstant variance of residuals. H0: The variance of the disturbances is constant. Ha: The variance of the disturbances is not constant. > library(car) > ncv.test(lm1) Non-constant Variance Score Test Variance formula: ~ fitted.values Chisquare = 511.0989 Df = 1 p = 0 Conclusion: the constant variance assumption has been violated. Correct the violation of the constant variance assumption by taking the log of y-values. > lm1=lm(I(log(CURRENT/PREV))~I(1/PREV),data=d1) > plot(fitted.values(lm1),rstudent(lm1)) > ncv.test(lm1) Non-constant Variance Score Test Variance formula: ~ fitted.values Chisquare = 3.131124 Df = 1 p = 0.07681076 > Yes, according to the non-constant variance test, after dividing through the original model by xi and taking log(y*), the non-constant variance problem has been eliminated. Exercise 4 Create a matrix [MONTH, PETROIMP] > MONTH=1:84 > PETROIMP=d4[,1] > d4.new=cbind(MONTH,PETROIMP) > d4.new<-data.frame(d4.new) Create a regression model, PETROIMP~MONTH > plot(d4.new[,1],d4.new[,2],xlab="MONTH",ylab="PETROIMP") > lm1=lm(PETROIMP~MONTH,data=d4.new) > summary(lm1) Call: lm(formula = PETROIMP ~ MONTH, data = d4.new) Residuals: Min 1Q Median 3Q Max -82.7958 -7.6434 0.3294 12.6561 38.2647 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 176.02846 4.38247 40.17 <2e-16 *** MONTH 1.02460 0.08957 11.44 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 19.9 on 82 degrees of freedom Multiple R-squared: 0.6148, Adjusted R-squared: 0.6101 F-statistic: 130.9 on 1 and 82 DF, p-value: < 2.2e-16 > abline(lm1) Check for outliers and influential points first by plotting studentized residuals vs. fitted values. > plot(cooks.distance(lm1),type=”h”) According to the plot of Cook’s distances, observation #19 appears to be an unusual or influential point. In 1992, it was around the time of the first Gulf War and the oil crisis. The unusually low petroleum import may have been due to these factors. Exercise 6 According to the histogram plots of studentized residuals and Cook’s distances, observations #8 and #10, Minnesota Twins and Oakland Athletics, respectively, appear to be winning more games than expected given the size of their payroll. Exercise 13 > lm1=lm(TIME~EXPER+NUMBER,data=d13) > summary(lm1) Call: lm(formula = TIME ~ EXPER + NUMBER, data = d13) Residuals: Min 1Q Median 3Q Max -100.071 -59.758 -7.368 47.828 120.619 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -179.288 125.580 -1.428 0.165 EXPER 10.189 13.119 0.777 0.444 NUMBER 32.969 1.436 22.962 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 68.41 on 27 degrees of freedom Multiple R-squared: 0.9514, Adjusted R-squared: 0.9478 F-statistic: 264.3 on 2 and 27 DF, p-value: < 2.2e-16 Linearity assumption H0: The relationship is linear. Ha: The relationship is not linear. > lm.a=lm(TIME~EXPER,data=d13) > lm.b=lm(TIME~factor(EXPER),data=d13) > anova(lm.a,lm.b) Analysis of Variance Table Model 1: TIME ~ EXPER Model 2: TIME ~ factor(EXPER) Res.Df RSS Df Sum of Sq F Pr(>F) 1 28 2594001 2 25 2450637 3 143364 0.4875 0.694 Conclusion: Fail to reject H0. The linearity assumption has not been violated here. H0: The relationship is linear. Ha: The relationship is not linear. > lm.a=lm(TIME~NUMBER,data=d13) > lm.b=lm(TIME~factor(NUMBER),data=d13) > anova(lm.a,lm.b) Analysis of Variance Table Model 1: TIME ~ NUMBER Model 2: TIME ~ factor(NUMBER) Res.Df RSS Df Sum of Sq F Pr(>F) 1 28 129188 2 5 364 23 128824 76.937 6.497e-05 *** Conclusion: Reject H0. The linearity assumption has been violated here. Center the variable “NUMBER.” > lm2=lm(TIME~EXPER+NUMBER_CENTER+I(NUMBER_CENTER^2),data=d13) > summary(lm2) Call: lm(formula = TIME ~ EXPER + NUMBER_CENTER + I(NUMBER_CENTER^2), data = d13) Residuals: Min 1Q Median 3Q Max -29.0858 -7.7640 -0.4909 5.9471 35.1318 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 65.7292 29.7670 2.208 0.03626 * EXPER 0.3712 2.9339 0.127 0.90030 NUMBER_CENTER 60.7731 20.4422 2.973 0.00628 ** I(NUMBER_CENTER^2) 230.5117 10.0542 22.927 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 15.14 on 26 degrees of freedom Multiple R-squared: 0.9977, Adjusted R-squared: 0.9974 F-statistic: 3776 on 3 and 26 DF, p-value: < 2.2e-16 H0: The variable NUMBER_CENTER^2 does not significantly to contribute accounting for the variations observed in TIME. Ha: The variable NUMBER_CENTER^2 significantly contributes to accounting for the variations observed in TIME. > anova(lm1,lm2) Analysis of Variance Table Model 1: TIME ~ EXPER + NUMBER_CENTER Model 2: TIME ~ EXPER + NUMBER_CENTER + I(NUMBER_CENTER^2) Res.Df RSS Df Sum of Sq F Pr(>F) 1 27 126365 2 26 5956 1 120409 525.65 < 2.2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Conclusion: Reject H0. Constant variance assumption H0: The variance for the residuals is constant. Ha: The variance for the residuals is not constant. > library(car) > ncv.test(lm2) Non-constant Variance Score Test Variance formula: ~ fitted.values Chisquare = 0.1636715 Df = 1 p = 0.6857985 Conclusion: Fail to reject H0. The variance for the residuals is constant. Normal distribution of disturbances H0: The distribution for the residuals is normal. Ha: The distribution for the residuals is not normal. > shapiro.test(lm2$residuals) Shapiro-Wilk normality test data: lm2$residuals W = 0.9592, p-value = 0.2958 Conclusion: Fail to reject H0. The distribution for the residuals is normal. Influential observations > plot(dffits(lm2),type="h") Autocorrelations H0: The disturbances are independent. Ha: The disturbances are not independent. > durbin.watson(lm2) lag Autocorrelation D-W Statistic p-value 1 -0.177793978 2.334587 0.588 After checking for and correcting violations to the linearity assumption, nonconstant variance assumption, normality assumption, autocorrelations, and influential observations, the model TIME ~ EXPER + NUMBER_CENTER + I(NUMBER_CENTER^2) appears to be the best model.