1 Linear Regression Reference: Text pages 440-456 Introduction Curve fitting is very important to Engineers, who often collect vast quantities of data to define more fundamental relationships to be used for engineering design. We often fit curves to data in order to describe the general trend in data. Least Squares Regression is the most common technique of curve fitting. Least Squares Regression Consider the data shown in the figure below. These data were collected through laboratory trials and each has error associated with it. As an engineer, you would like to fit a line through this data to mathematically describe this relationship. Least squares is the most common technique to compute the coefficients of the best fit line. What we desire is to develop an equation that gives the best approximation of each y value over the range of x and y values. Thus equation describing the true value of each y point is Eq. 1 where y = true value of y a o = intercept of the line a 1 = slope of the line e = residual error between the linear model and the true y value 0 2 4 6 8 10 12 0246810 exaay ++= 10 2 The Best Fit line is the one that minimizes the square error (ie. e 2 ) between each pair of predicted and measured values of y. Mathematically, we can define the sum of the residual error, S r as Eq. 2 Where n is the number of data points. Since y i model = a o + a 1 x, we can substitute this into the previous equation to get Eq. 3 How do you determine the coefficients a 0 and a 1 for the best fit line? We take the partial derivative of S r with respect to each coefficient, set the partial derivative to zero, and solve for the coefficients. Eq. 4 Eq. 5 Now, taking the partial derivative of S r with respect to a 1 gives Eq. 6 Eq. 7 ())1(*2 1 10 0 −−−= ∂ ∂ ∑ = n i ii r xaay a S () 2 1 10∑ = −−= n i imeasuredir xaayS () 2 11 mod 2 ∑∑ == −== n i n i elimeasurediir yyeS () ∑ = −−−= ∂ ∂ n i ii r xaay a S 1 10 0 2 ())(*2 1 10 1 i n i ii r xxaay a S −−−= ∂ ∂ ∑ = () i n i ii r xxaay a S *2 1 10 1 ∑ = −−−= ∂ ∂ 3 Now, set the partial derivatives of Eq. 5 and Eq. 7 to zero gives: Eq. 8 Eq. 9 Rearranging Eq. 8 and 9 gives ∑∑ == =⎟ ⎠ ⎞ ⎜ ⎝ ⎛ + n i i n i i yaxan 1 1 1 0 Eq. 10 i n i i n i i n i i yxaxax ∑∑∑ === =⎟ ⎠ ⎞ ⎜ ⎝ ⎛ +⎟ ⎠ ⎞ ⎜ ⎝ ⎛ 1 1 1 2 0 1 Eq. 11 This gives us 2 equations and 2 unknowns. Note that we know the x and y values, but we do not know the coefficients a 0 and a 1 . We can now solve Eq. 10 and 11 explicitly for the coefficient a 0 and a 1 by Eq. 12 Eq. 13 () ∑ = −−−= ∂ ∂ = n i ii r xaay a S 1 10 0 20 () i n i ii r xxaay a S *20 1 10 1 ∑ = −−−= ∂ ∂ = 2 11 2 111 1 )( ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ − − = ∑∑ ∑∑∑ == === n i i n i i n i i n i i n i ii xxn yxyxn a xaya 10 −= 4 Example: Compute the coefficients of the best-fit line for data shown below. x i y i x i * y i x i 2 1 0.5 0.5 1 2 2.5 5 4 3 2 6 9 4 4.5 16 16 5 3.5 17.5 25 6 6 36 36 7 5.5 38.5 49 Sum = 28 Avg = 4 Sum = 24 Avg = 3.43 Sum = 119.5 Sum = 140 Solution: First, compute columns 3 and 4 and the sum of all columns. Next, plug in the appropriate values into Eq. 12 and 13 to compute a 0 and a 1 . Thus, the equation of the best-fit line is 2 11 2 111 1 )( ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ − − = ∑∑ ∑∑∑ == === n i i n i i n i i n i i n i ii xxn yxyxn a 8392857.0 )28()140(7 )24(28)5.119(7 2 1 = − − =a 07142857.0 )4(8392857.042857.3 10 = −= −= xaya xy 8392857.007142857.0 += 5 Quantification of Error of Linear Regression The sum of squares of the residual, S r is defined in Eq. 14. This is the error between the predicted y and measured y values squared and summed over all data points. If the model fits the data perfectly, S r would be zero. If the model does not fit the data very well, the S r value would be very high. Eq. 14 Where n = number of data points y i = measured y values x i = measured x values a 0 = intercept of best fit line a 1 = slope of best fit line The standard deviation of the best-fit regression line can be computed by Eq. 15 The standard deviation of the regression line can be interpreted in the same was as the standard deviation of a population. Thus, based on Chebyshev’s rule: A. Approximately 68% of the data points will fall with ± 1 standard deviation of the line B. Approximately 95% of the data points will fall with ± 2 standard deviations of the line C. Essentially all the measurements will fall with 3 standard deviations of the line The total sum of squares of the error between the data and the mean of the data by Eq. 16 S t describes the total error (squared) you would have if you represented the relationship between x and y using the mean value of y rather than a regression line. () 2 1 10∑ = −−= n i imeasuredir xaayS 2 / − = n S S r xy ∑ = −= n i it yyS 1 2 )( 6 The value (S t – S r ) is the improvement or reduction in error by describing the relationship of y to x using a regression line rather than just the mean value of y. The Coefficient of Determination, r 2 , is another way to characterize the error between predicted and measured y values. This can be computed by Eq. 17 The Coefficient of Determination can be interpreted as the percent of the variation between predicted and measured y values that can be explained by the best-fit regression line. The r 2 value ranges from 0-1.0. A value of 0 means that none of the variation can be explained by the regression, where a value of 1.0 means that 100% of the variation can be explained by the line. This means that the line falls through every data point. t rt S SS r − = 2 lietang Microsoft Word - Linear Regression.doc