Georgia Hamilburg SOCU 3301 March 4, 2010 Data & Methods Data The data I used in my research project came from the General Social Survey from 1972 to 2004. The General Social Survey traces statuses, behaviors and opinions of the United States population. Through research on the structure and development of American society, the GSS is able to monitor social change within the United States and compare it to other countries. The total number of respondents surveyed in the GSS from 1972-2004 is 46,510. Of the 46,510, my project studied 11,486 respondents. When first utilized in 1972, the researchers used a modified probability sample with a block level quota until they were rewarded a full probability design in 1975. The block level includes quotas based on sex, age, and employment status, however the quotas may contain biases. Since 1980, the GSS used a random stratified sampling technique. The full probability design includes full probability and block quotas to provide data for interesting comparisons and to account for differences over time as either shifts in sample designs or changes in response patterns. In this first decade of the GSS, the frame for the study was based mostly on population size. It was a probability sample of identifiable households selected from Standard Metropolitan Statistical Areas and Probability Sampling Units. In 1980, the survey?s frame included households from less PSUs, but was divided into more subgroups based on controlled variables. In this case, the households were chosen at random from the groups they were previously placed into. The methods used in 1980 allowed for more statistical precision. The GSS also accounted for sampling errors. The known sampling errors include the over sampling of blacks in 1972, underrepresentation of males in all the probability samples, and 18 year olds in general. All other sampling errors were corrected by weighting the sample. My study uses unweighted data and I am not using the weights. Methods Variables The variables I am using from the GSS 1972 to 2004 are drunk, age, gender, and social class. One of the variables I used is drunk. This came from the variable question, ?Do you sometimes drink more than you should?? The survey coded this by giving values to categories. 1 for yes, 2 for no, 0 for not applicable, 8 for don?t know, and 9 for no answer. For my study, I recoded this variable to have 1 still be for yes, 0 for no, and the others I recoded into missing data. Another variable I used for this study is age. This variable came from the question of how old the respondent was at the time of the interview. This variable gave values to ages 18 ? 88, the value of 89 for 89 or older, 98 for don?t know, and 00 for no answer. I recoded age into a continuous variable. My study included the categories of 18-20, 21-25, 26-30, 31-35, 36-40, 41-45, 46-50, 51-55, 56-60, and 61 & up. The rest of the variables from the GSS I recoded into missing data. The next variable I used was gender. The literal question for this variable was the respondent?s sex. The GSS coded this variable into two categories. Male was given the value of 1 and 2 for female. I recoded this variable and kept the value of 1 for male and gave female the value of 0. The final variable I used in my study was social class. The variable of subjective class identification came from the question, ?if you were asked to use one of four names for your social class, which would you say you belonged in: the lower class, the working class, the middle class, or the upper class?? The GSS coded the categories as follows: 1 for lower class, 2 for working class, 3 for middle class, 4 for upper class, 5 for no class, 0 for not applicable, 8 for don?t know, and 9 for no answer. I recoded this variable into continuous class. My study included the categories of lower class, working class, middle class and upper class. The other categories included in the original survey I recoded into missing data. Descriptive Analysis The descriptive analysis of this paper is exhibited by frequency distributions and cross tabulations. The frequency distributions were done on the dependent variable drunk and the independent variables age, gender, and social class. These frequencies were compared to reliable outside data to confirm generalizability. Cross tabulations were done on each of the variables age, gender, and social class versus the dependent variable drunk. These cross tabulations drunk versus age, drunk versus gender, and drunk versus social class were evaluated using appropriate significance tests, lambda, tau, and Pearson?s r. Regression Analysis For my study, I will use logistic regressions on the data because my dependent variable is dichotomous. My dependent variable is dichotomous because it only has two categories: either the person does sometimes drink more than they should or they do not. The models presented are: Y(drunk) = A + B(age 18 ? 20) Y(drunk) = A + B(age 21 ? 25) Y(drunk) = A + B(age 26 ? 30) Y(drunk) = A + B(age 36 ? 40) Y(drunk) = A + B(age 41 ? 45) Y(drunk) = A + B(age 46 ? 50) Y(drunk) = A + B(age 51 ? 55) Y(drunk) = A + B(age 56 ? 60) Y(drunk) = A + B(gender) Y(drunk) = A + B(lower class) Y(drunk) = A + B(working class) Y(drunk) = A + B(upper class) Y(drunk) = A + B(age 18 ? 20) + B(age 21 ? 25) + B(age 26 ? 30) + B(age 31 ? 35) + B(age 36 ? 40) + B(age 41 ? 45) + B(age 51 ? 55) + B(age 46 ? 50) + B(age 56 ? 60) + B(gender) + B(lower class) + B(working class) + B(upper class) For each model, I will analyze the coefficient, the significance of the coefficient, and the logged odds. The coefficient is the number that stands for the variable. The significance test is the relationship between the variable and the reference variable (the one that is left out in continuous variables). The answer to the equations will tell you how strong the relationship is between the two variables while comparing the sample to the population. The logged odds describes the strength of the dependent variable versus the independent variables in the logistic regression. The goodness of fit will be analyzed by pseudo r-squared analysis, log-likelihood, and chi-squared difference techniques.