# Data Project Writeup

## Statistics 121 with Armagan at Duke University *

- StudyBlue
- North-carolina
- Duke University
- Statistics
- Statistics 121
- Armagan
- Data Project Writeup

Qihua F.

File Size:
9
#### Related Textbooks:

Applied Linear Statistical Models#### Related Textbooks:

Applied Regression Analysis: A Second Course in Business and Economic Stati...
Advertisement

Predicting Ozone Levels in Los Angeles in 1976 STA 121 Armagan Qihua Fan 5/4/2010 Ozone is a secondary pollutant, formed through interactions of various chemical and meteorological factors. The relationship between ozone and various precursor variables are locale-specific; relationships observed in one area should not be readily extrapolated to another. A multiple regression model was created to predict the daily maximum one-hour-average ozone readings in Los Angeles, given data collected on various meteorological variables through the year 1976. It was found that local relative humidity, local air temperature, month of the year, the pressure gradient, wind speed, and visibility significantly contributed to constructing a model predicting the daily maximum one-hour-average ozone reading in Los Angeles. The final regression model predicts the daily maximum one-hour-average ozone reading the Los Angeles basin area in 1976 fairly well: the 95% prediction intervals successfully capture the actual ozone level readings in the validation data set. Introduction Ozone is a secondary pollutant, formed through interactions of various factors, including sunlight intensity, spectral distribution, atmospheric mixing and processing on cloud and aerosol particles, concentration so ozone precursors in the air, and the chemical reaction rates of the precursors. There are several layers of atmosphere; the one of most interest to ozone studies is the troposphere, the major layer of atmosphere lying closest to the earth’s surface. The lower sublayer of the troposphere is called the planetary boundary layer, or the PBL. Various meterological processes affect ozone concentration in the PBL. Inhibition of vertical atmospheric mixing and light winds allow pollutants to accumulate in the PBL over urban areas. Physical obstacles, such as mountain barriers, can inhibit horizontal atmospheric mixing, resulting in a higher frequency and duration of days with high ozone concentrations in basin-like terrain, such as Los Angeles. Increased wind speed aids in the dispersal of pollutants. For example, in southern urban areas such as Houston, TX and Atlanta, GA, ozone concentrations have been observed to decrease with increasing wind speed. In addition, local rate of ozone formation depends on temperature, availability of ultraviolet radiation capable of initiating ozone formation reactions, and the concentrations of ozone precursors; high temperatures and sunlight enhance the photochemical activity of ozone precursors. However, the relationship between ozone and various precursor variables are locale-specific; relationships observed in one area should not be readily extrapolated to another. For example, ozone concentrations were observed to increase with temperature in Baltimore, MD, but no correlation was observed between ozone levels and air temperature in Phoenix, AZ. The purpose of this research is to generate a multiple-regression model that predicts the daily maximum one-hour-average ozone reading in Los Angeles, given data collected on various relevant variables throughout the year 1976. Previous studies suggest that ozone concentration has strong relationships with the meteorological variables including temperature, wind speed, relative humidity, and cloud cover (visibility). Previous investigations have shown that ozone levels vary through the seasons. It was hypothesized that variables significantly contributing to the prediction of daily ozone readings in Los Angeles would include month of the year, local temperature, wind speed, relative humidity, and visibility. Description of Ozone Data and Meteorological Data All statistical analyses were performed in the R environment. The data on Los Angeles ozone pollution in 1976 was obtained under the “mlbench” library, named “Ozone” in the form of a data frame with 366 observations (one observation corresponds to one day) on 13 variables. α<0.05 level was chosen as significant. Dependent variable: Ozone concentration. The dependent variable is a continuous variable, measuring the daily maximum one-hour-average ozone reading in Los Angeles, given as ozone mixing ratios in parts per million (ppm). Predictor Variables Month of the Year. This is a categorical predictor variable, coded as 1=January, 2=February… 12=December. Ozone levels have been shown through previous studies to vary seasonally. Ozone precursors such as carbon monoxide and nitrogen dioxide in the urban areas of southern California tend to accumulate when trapped over the city by inversion layers, especially during the colder seasons. Lower wind speed affects dilution and dispersal of pollutants, while low temperatures reduce vertical atmospheric mixing and cause near-surface inversions to be stronger and last longer. Contrariwise, ozone follows the opposite seasonal pattern. During the summer months, a greater availability of sunlight is conducive to ozone production. Day of the Month. This original variable was removed from the data frame (see below). Day of the Week. This is a categorical predictor variable, coded as 1=Monday, 2=Tuesday… 7=Sunday. Ozone precursors such as carbon monoxide and nitrogen dioxide in the cities of southern California are mainly produced by vehicular sources. Traffic conditions and vehicle use are more likely to exhibit a pattern over the span of the week than over a month. Due to this reason, day of the month was not considered as a predictor variable. VDHT. The 500 millibar pressure height (m) measured at Vandenberg AFB is a continuous predictor variable. The 500 millibar (mb) pressure height gives the distance from sea level to approximately the middle of the atmosphere, at which height the pressure is 500 mb. A low 500-mb height indicates low pressure at earth’s surface, whereas a high 500-mb height indicates high pressure at earth’s surface. The 500-mb pressure height also gives information about atmospheric temperature and surface weather. Generally, areas with high pressure have clearer, sunnier weather than areas with low pressure, which tend to have cloudier, stormier weather. WDSP. Wind speed (mph) at Los Angeles International Airport (LAX) is a continuous predictor variable. Bloomfield, Royle, Steinberg, and Yang (1996) characterized the relationship between ozone concentration and various meteorological variables suggested that ozone is related to wind speed through a simple nonlinear linear function. HMDT. Percent humidity (%) at LAX is a continuous predictor variable. Graphical results from Bloomfield, Royle, Steinberg, and Yang (1996) suggest that ozone concentration is linearly related with relative humidity. Previous studies observed a low relative humidity was accompanied by relatively high ozone mixing ratios. SBTP. Temperature, in degrees Fahrenheit, at Sandburg Air Force Base, CA is a continuous predictor variable. Graphical results from Bloomfield, Royle, Steinberg, and Yang (1996) suggest that ozone concentration is most strongly related to the local air temperature through a polynomial function. EMTP. Temperature, in degrees Fahrenheit, at El Monte, CA, is a continuous predictor variable. IBTP. Inversion base temperature at LAX, in degrees Fahrenheit, is a continuous predictor variable. An inversion describes a deviation from normal change of a meteorological property with the increase in altitude. An inversion layer is a thin layer of atmosphere in which the temperature decrease with increasing altitude is much less than normal. IBHT. Inversion base height (ft.) at LAX is a continuous predictor variable. The inversion base height is the height from the earth’s surface to the bottom of the inversion cap. This distance is also called the mixing depth because it is the estimated height to which pollutants released from earth’s surface mix. DGPG. The pressure gradient (mm Hg) from LAX to Daggett, CA, is a continuous predictor variable. The pressure-gradient force has both vertical and horizontal components. The vertical component is generally in balance with gravitational forces. The horizontal differences in pressure are the result of contrasts in thermal heat or features of physical terrain, such as a mountain barrier. The horizontal component motivates air movement from areas of high pressure towards areas of lower pressure, though other forces usually preven the air from moving directly across isobars. The pressure gradient force is also inversely proportional to the air density, a relationship essential in understanding the behavior of upper winds. VSTY. Visibility (miles) measured at LAX, is a continuous predictor variable, giving the distance at which an object or light can be clearly discerned. Visibility was shown to increase the linear fit of the model by Bloomfield, Royle, Steinberg, and Yang (1996). Modeling Ozone Concentrations The data frame Ozone was cleaned by removing all observations containing missing entries using the function remove.NA{agce}. This left 203 observations in total. A random subset of 180 observations were used as the training set of data; the remaining 23 observations were used as the validation set, or hold out sample. In addition to the initial 12 variables, quadratic terms created for the continuous predictor variables, and all two-way interactions except between categorical variables were considered in the variable selection process. The multiple regression model was developed using only the observations in the training data set. Variables were selected using Akaike’s information criterion (AIC). The null model included no predictor variables. The full model included all quadratic terms created for the continuous predictor variables, all two-way interactions except between categorical variables, and the initial unmanipulated variables. Because the meteorological variables are related to one another, the forward direction was chosen in the step-wise selection process to achieve a simpler model. The model resulting from this step had an AIC score of 480.07 and is given below: lm(formula = Y ~ SANDBURG.SQR + HMDT + EMTP.SQR + factor(MONTH) + DGPG.SQR + WDSP.SQR + VSTY + INV.TEMP, data = Sample) Redundant variables were removed from the optimal model resulting from the variable selection algorithm. Because EMPT.SQR already accounts for the local air temperature, and in order to reduce the general variance-inflation factor (GVIF) score, SANDBURG.SQR was removed from the regression model. The GVIF for SANDBURG.SQR was 12.31. INV.TEMP had a GVIF score of 8.94 and p-value of 0.2122 on the significance of the coefficient estimate, and was also removed from the model. Importance of the variables and the significance of their contribution to the predictive power of the model were assessed by using nested F tests. Removal of SANDBURG.SQR did not significantly reduce the predictive power of the model (p=0.9914), nor did the removal of INV.TEMP did not significantly reduce the predictive power of the model (p=0.2090). After examining the plots of residuals vs. each predictor variable in the model, DGPG.SQR and WDSP.SQR were transformed to achieve more randomly scattered residual plots. The variable DGPG.SQR was raised to the 1/3 power; WDSP was raised to the 1/5 power. The resulting model is given below: lm(formula = Y ~ HMDT + EMTP.SQR + factor(MONTH) + I(DGPG.SQR^(1/3)) + I(WDSP.SQR^(1/5)) + VSTY, data = Sample) The resulting model then underwent regression diagnostics to test for violations of linear regression assumptions. - The linearity assumption was tested using the pure error lack of fit test; the linearity assumption was not violated. - The constant variance of the residuals was tested using the function ncv.test {car}. The test for constant variance yielded a Chi-square value of 29.35, p<0.0001. The dependent variable Y was transformed to Y^(1/2) to achieve constant variance in the residuals (Chi-square = 3.48, p=0.062). - The normality of the residuals was tested using the Shapiro-Wilkes normality test, through the function shapiro.test {car}. The Shapiro-Wilkes test yielded a W-score of 0.9951, p=0.8171; the normality assumption was not violated. - The independence of residuals was tested using the Durbin Watson test for autocorrelation. The lag-1 test for autocorrelation yielded a DW-statistic of 2.065, p=0.676; the independence assumption for residuals was not violated. The model was examined for leverage points and outliers by calculating Cook’s Distance for each observation. The cutoff value for Cook’s Distance is given as F(α=0.05, 6, 175)=2.15. There were no observed outliers or leverage points (see Figure 1). The final model for predicting the daily maximum one-hour-average ozone reading in Los Angeles has an AIC score of 273.2754. The final model has a multiple R-squared value of 0.8439, indicating that 84.39% of the variation in the dependent variable is explained by the model. The model also has an adjusted R-squared value of 0.8286. The equation of the final model is given: lm(formula = I(Y^(1/2)) ~ HMDT + EMTP.SQR + factor(MONTH) + I(DGPG.SQR^(1/3)) + I(WDSP.SQR^(1/5)) + VSTY, data = Sample) A summary of the final regression model is given in below in Table 1: Residuals: Min 1Q Median 3Q Max -1.38889 -0.36666 0.03539 0.33191 1.18923 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.236e+00 3.616e-01 3.417 0.000800 *** HMDT 7.941e-03 2.748e-03 2.889 0.004387 ** EMTP.SQR 6.092e-04 4.638e-05 13.134 < 2e-16 *** factor(MONTH)2 5.688e-01 2.085e-01 2.728 0.007062 ** factor(MONTH)3 8.058e-01 1.830e-01 4.402 1.93e-05 *** factor(MONTH)4 1.094e+00 1.843e-01 5.934 1.72e-08 *** factor(MONTH)5 9.280e-01 2.319e-01 4.002 9.53e-05 *** factor(MONTH)6 7.396e-01 2.098e-01 3.524 0.000551 *** factor(MONTH)7 8.092e-01 2.531e-01 3.196 0.001671 ** factor(MONTH)8 7.143e-01 2.253e-01 3.170 0.001818 ** factor(MONTH)9 1.543e-01 2.299e-01 0.671 0.503216 factor(MONTH)10 2.461e-01 1.979e-01 1.244 0.215407 factor(MONTH)11 -1.889e-01 1.895e-01 -0.997 0.320211 factor(MONTH)12 -3.508e-01 1.832e-01 -1.915 0.057242 . I(DGPG.SQR^(1/3)) -5.297e-02 9.538e-03 -5.553 1.12e-07 *** I(WDSP.SQR^(1/5)) -1.861e-01 9.797e-02 -1.900 0.059207 . VSTY -1.170e-03 5.841e-04 -2.004 0.046780 * --- Figure 1 Histogram plot of calculated Cook’s Distance for each observation in the training data set against each observation’s respective index. The critical value for Cook’s Distance is calculated as F(α=0.05, 6, 175)=2.15. There are no observed outliers or leverage points. Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.4915 on 163 degrees of freedom Multiple R-squared: 0.8439, Adjusted R-squared: 0.8286 F-statistic: 55.09 on 16 and 163 DF, p-value: < 2.2e-16 Using the multiple regression model given above to predict the dependent variable values, the resulting regression model had an F-statistic of 4.503 on 15 and 7 degrees of freedom, with a p- value of 0.026. This indicates that the model is able to predict the dependent variable responses fairly accurately. The prediction interval captures all of the actual dependent variable values, shown below in Figure 2. Discussion The final multiple regression model for predicting the daily maximum one-hour-average ozone reading in Los Angeles based on data collected throughout the year 1976 is of the form: 𝑌~𝐻𝑀𝐷𝑇+𝐸𝑀𝑇𝑃2 +𝑓𝑎𝑐𝑡𝑜𝑟 𝑀𝑂𝑁𝑇𝐻 +𝐷𝐺𝑃𝐺2 3 +𝑊𝐷𝑆𝑃2 5 +𝑉𝑆𝑇𝑌. No two-way interactions between the predictor variables significantly contributed to the predictive power of the regression model. Because it was necessary to perform power transformations on the dependent and some of the independent variables, it is difficult to interpret the coefficient estimates in terms of the relationship between ozone levels and various meteorological factors. Figure 2. Plot of the observed daily maximum one-hour-average ozone readings (OZONE.CONC) from the validation data set against their respective index numbers (INDEX). The dashed line shows the 95% confidence interval estimates on the predicted values. Although the predictor variables HMDT and EMTP.SQR had coefficient estimates close to 0 (0.0105 and 0.0006, respectively), they were still calculated as statistically significant (both with p- values <0.0001), because the standard errors were minuscule. Although this appears to be inconsistent with the findings described by Bloomfield, Royle, Steinberg, and Yang (1996) that ozone level is inversely correlated with relative humidity, it is important to note that this apparent inconsistency may be explained by regional differences. The study by Bloomfield, Royle, steinberg, and Yang (1996) examined the effect of meteorological factors on ozone levels in the Chicago area, whereas the purpose of this project is to predict the ozone levels based on local relative humidity in the Los Angeles basin area. It was earlier noted that relationships observed between ozone precursor factors and ozone levels in one area should not be readily extrapolated to another area. The ozone levels showed a gradual but significant increase through the warmer months (March through August), indicated by positive coefficient estimates and significant p-values for those estimates. This result is consistent with the previously observed seasonal pattern in ozone levels due to the availability of sunlight. The negative coefficient estimates for pressure gradient and wind speed indicate that increased ozone levels are associated with decreased pressure gradients and wind speeds. This is consistent with the idea that light winds and decreased air movement resulting from a decreased pressure gradient allow pollutants in the air to accumulate. According to Bloomfield, Royle, Steinberg, and Yang (1996), wind speed is related to ozone levels through a simple nonlinear function 1 1+ Wind Speed𝑣 where v is a nonlinear least squares fitted value. Such a transformation may have contributed to the predictive power of the final regression model, but a reciprocal transformation of the predictor variable WDSP was not possible due to the presence of “0” values as measurements. Another limitation of this model was that it did not take into account the ozone level from the previous day. According to a subregional scale study by Feister and Balzer (1991) on the surface ozone level over five stations in the German Democratic Republic over the period 1972-1987, the most important variable in predicting the ozone concentration was the ozone concentration from the previous day. However, due to the necessity of removing observations with missing items, a continuous time series examination was not possible using this data set. Also, given the relatedness of the various meteorological factors, it was surprising that no interaction terms between the variables contributed significantly to the predictive power of the regression model. Overall, the final regression model predicts the daily maximum one-hour-average ozone reading the Los Angeles basin area in 1976 fairly well: the 95% prediction intervals successfully capture the actual ozone level readings in the validation data set. In conclusion, local relative humidity, local air temperature, month of the year, pressure gradient, wind speed, and visibility significantly contributed to constructing a model predicting the daily maximum one-hour-average ozone reading in Los Angeles. References Bloomfield, P., J. A. Royle, L. J. Steinberg, and Q. Yang. 1996. Accounting for meteorological effects in measuring urban ozone levels and trends* 1. Atmospheric Environment 30, (17): 3067-77. Breiman, L., and J. H. Friedman. 1985. Estimating optimal transformations for multiple regression and correlation. Journal of the American Statistical Association: 580-98. Feister, U., and K. Balzer. 1991. Surface ozone and meteorological predictors on a subregional scale. Atmospheric Environment.Part A.General Topics 25, (9): 1781-90. Kinney, P. L., and H. Özkaynak. 1991. Associations of daily mortality and air pollution in los angeles county. Environmental Research 54, (2): 99-120. Linn, WS, Y. Szlachcic, H. Gong Jr, PL Kinney, and KT Berhane. 2000. Air pollution and daily hospital admissions in metropolitan los angeles. Environmental Health Perspectives 108, (5): 427. Ritz, B., F. Yu, S. Fruin, G. Chapa, G. M. Shaw, and J. A. Harris. 2002. Ambient air pollution and risk of birth defects in southern california. American Journal of Epidemiology 155, (1): 17. US Environmental Protection Agency (EPA). Air quality criteria for ozone and related photochemical oxidants. EPA/600/AP-93/004a-c. Washington, DC: U.S. Office of Research and Development, EPA; 1993. Qihua Fan Predicting Ozone Levels in Los Angeles in 1976 STA 121 Armagan

Advertisement

#### Words From Our Students

"StudyBlue is great for studying. I love the study guides, flashcards, and quizzes. So extremely helpful for all of my classes!"

Alice, Arizona State University