Measurement: Reliability, Validity, Scales A. Kent Van Cleave, Jr., Ph.D. Major source: http://clem.mscd.edu/~davisj/prm2/exper2.html; http://clem.mscd.edu/~davisj/prm2/exper3.html http://psych.athabascau.ca/html/Validity/concept.shtml The information below is very closely adapted (copy, paste, slight edit) from these sources. Reliability, Validity, Scales Overview: Reliability Internal Validity External Validity Threats to Validity Scales Construction Reliability & Validity Testing and Evaluation Reliability Are the results of the experiment / scale repeatable? If the experiment were done the same way again / if the scale were administered again, would it produce the same results? Reliability is a requirement before the validity of the experiment / scale can be established. Types of reliability http://www.socialresearchmethods.net/kb/reltypes.php Internal Validity Internal validity -- defending against sources of bias arising in research design. Accuracy or truth-value Does the research design lead to true statements? Did the independent variable cause the effects in the dependent variable? In experimental research, this usually means eliminating alternative hypotheses. External Validity Generalizability Can the results can be applied in another setting or to another population of research participants? Threats to Validity Fundamental equation Observed = True + Error The concern of the researcher is to reduce error. Failing to reduce error, to control for or statistically account for it. Error may lead to either Type 1 error (reject a true null hypothesis) or Type 2 error (failure to reject a false null hypothesis). These are threats to internal validity Threats to Validity Decision True State of the Null Hypothesis H0 True H0 False Reject H0 Type I error Correct Do not Correct Type II error Reject H0 Threats to Validity Subject Effect / Selection Effect Results influenced by systematic differences in research participants ("subjects") assigned to different conditions or treatments. Letting people choose to be part of a program or treatment and using others who did not choose to be part of the program as a control group is an issue. “Self-selected" groups are usually different from groups made up of people who do not choose to be in a treatment group. Common solution: Matching or random assignment to groups Threats to Validity History effect Results influenced by events outside the experiment. Example: There is one group of research participants who are being measured at several points in time. Some event that is not part of the research, which occurs at the same time as the treatment could affect the results. (Tylenol scare) Threats to Validity Maturation effect: Results influenced by changes within subjects over time, e.g., growth, warm-up, fatigue, learning to learn. This is a problem in research that measures a dependent variable over a period of time and especially in research with repeated exposures to the independent variable. Common solution: A control group which is measured over the same period of time but does not receive the treatment Threats to Validity Testing effect, reactivity Results influenced by the data gathering procedures Example: being influenced by the test or learning from one test administration to the next. Common solution: Use a control group which is also measured, but without the treatment, or with an alternative form of treatment. Threats to Validity Instrumentation Results influenced by an aberration in measuring tools, either mechanical instrument or test. Example: The dependent variable (participants' mental health) may be measured by a poor test. Common solution: Select or develop a better measure. Threats to Validity Regression artifact, regression to the mean Results influenced by extreme scores moving toward the mean over time. Example: If a group is made up of those with the worst mental health scores (say, the most anxious or the most depressed), over time they are likely to improve without therapy. This may be mistakenly attributed to the therapy. Common solution: Use a control group which has similar characteristics (mental health scores) but which does not receive the new therapy. Threats to Validity Attrition or mortality effect When subjects drop out of an experiment, it can bias the results, especially when more subjects drop out of one treatment condition than another. This leads to a kind of subject effect because the subjects in the different groups are no longer equivalent. Common solution: If attrition looks like a problem, find out why participants dropped out. This can sometimes give important clues about the study. Threats to Validity Selection-Maturation Interaction Subjects in two comparison groups differ with respect to the independent variable and a subject-related variable such as age. Suppose also that the dependent variable is measured twice for each group, once at Time A and later at Time B, and that the independent variable is introduced in the interim. If the change in scores on the dependent measure from Time A to Time B differs between the two groups, this discrepancy may be due to the independent variable or to distinctive naturally occurring developmental processes for the two age categories that comprise the two comparison groups. Threats to Validity Experimenter expectancy effect, Experimenter bias Results influenced by the experimenter's actions or expectations. Studies have shown that researchers tend to find the results they are looking for, a kind of self-fulfilling prophecy. Effects range from overt cheating to subtle influences on data collection and interactions with research participants. (Clever Hans) Experimenters are not always aware of the extent of these influences. Threats to Validity Experimenter expectancy effect, Experimenter bias Example: If the researcher is the one to assess the research participants' mental health (the dependent variable), he or she may distort the assessments in the direction of the research hypothesis. Common solution: Use independent judges (inter-rater reliability), or more objective measurements of the dependent variable. Threats to Validity Demand characteristics, Hawthorne effect: Called the Hawthorne Effect after a famous series of experiments at a manufacturing plant in Hawthorne, Ohio. Researchers selected a group of factory workers and changed various conditions such as lighting to see what would increase performance. Any change increased performance, suggesting that research participants were responding to the general expectation that they would perform better, and to the social dynamics of being observed closely. Threats to Validity Demand characteristics, Hawthorne effect: Results are influenced by subjects' expectations of desired behavior in the research setting or the social psychology of the experiment. (response bias) Called "demand" because participants may perceive a demand to behave or report on themselves in a certain way. Threats to Validity Demand characteristics, Hawthorne effect: Example: The researcher communicates his or her expectations to the research participants which in turn influences their responses. If the researcher is measuring depression, research participants may report less depression regardless of their feelings because they think that is what is expected of them. Common solution: Blind and double-blind designs help avoid these problems. Also, using a control group which is measured the same way (getting some of the same influences) without the treatment. Threats to Validity Halo effect: The researcher's expectations about certain subjects based on some subject characteristics. E.g., an outgoing, sociable subject is rated as being more intelligent or having higher values. Example: Judges rating participants may ascribe better mental health based on other characteristics. Common solutions: Random assignment, blind judges, more objective measures. Threats to Validity Non-specific factors and alternative hypotheses that may arise in a particular experiment. Example: In psychotherapy research, the specific intervention may not cause the benefits. Rather, the therapeutic relationship may lead to benefits. The independent variable, the new therapy, is not causing the benefits. Instead, the relationship factor which is confounded with the independent variable is causing the effects. Common solution: A control group which will be exposed to the same history but not the new form of psychotherapy. Threats to Validity We can only rarely control for all the threats to internal validity. It is important that we acknowledge them in our results. Scales A scale is a group of items designed to assess some psychological variable using a quantitative attribute. Usually, we use several items to measure slightly different aspects of the same thing. The overall concern with psychological tests is whether they measure what they are supposed to measure. Using any pencil and paper, computer or other form of self-report scale where the responses to items will be aggregated to measure some construct, scale reliability and validity become concerns. Scales Three sources of error: Characteristics of the test itself - Sampling error; Effectiveness of distractors; Difficulty of items relative to test taker ability Characteristics of the test taker - test takers may make careless errors, misinterpret test instructions, forget test instructions, inadvertently omit test sections, or misread test items. Scales Three sources of error: Scoring factors - On constructed-response items, sources of error include clarity of the scoring rubrics, clarity of what is expected of the test taker, Rater errors - Raters may be inconsistent, sometimes change their criteria while scoring; are subject to biases such as the halo effect, stereotyping, perception differences, leniency/stringency error, and scale shrinkage. Scale Reliability Internal consistency reliability focuses on the degree to which the individual items are correlated with each other; often called homogeneity. Formulas for this intercorrelation include Cronbach's alpha, the Kuder-Richardson Formula 20 (KR-20) and the Kuder-Richardson Formula 21 (KR-21). Most testing programs report Cronbach's alpha, functionally equivalent to KR-20. Scale Reliability Test-retest Reliability - how well results from one administration of the test relate to results from another administration of the same test at a later time. Often assessed by use of a split-half reliability index. In the split half reliability measure, the scores for two halves of a test are correlated with each other. Alternatively, a test-retest method may be used, but has the repeated testing internal consistency threat. An alternate forms method also may be used. Chronbach’s alpha also is used and is essentially the mean of all possible split-half coefficients. Scale Reliability Construct Validity - the ability of a test to measure the psychological construct, such as depression, that it was designed to measure. One way this can be assessed is through the test’s convergent or divergent validity, which refers to whether a test can give results similar to other tests of the same construct and different from tests of different constructs. Scale Reliability Content Validity refers to the ability of a test to sample adequately the broad range of elements that compose a particular construct. Criterion-related Validity refers to the ability of a test to predict someone’s performance on something. For example, before actually using a test to predict whether someone will be successful at a particular job, you first will determine whether persons already doing well at that job (the criterion measure) also score high on your proposed test. Scale Reliability Exercise: http://psych.athabascau.ca/html/Validity/frames.html Steps in Scale Construction Define the construct. Go to the literature if available (deductive approach). Is an already validated scale available? Gather and analyze qualitative data (inductive approach) Often the construct will have several dimensions; these should be defined, also. Sometimes subject matter experts will be used to sort the concepts of the construct into dimensions. Steps in Scale Construction Generate items. Ask questions that sample the construct domain and all of the dimensions. You will need a subscale for each dimension you choose to measure. Ask enough questions to adequately sample the domain and each dimension. Use a Darwinian approach. Items that do not load on your construct will be eliminated. Limit the questions to just enough to get the job done. Too many items turn off the respondent and may artificially inflate the validity of the scale. Use subject matter experts to sort the items back to their dimensions. Retain the items that 80-85% sort where you intend them to go. Steps in Scale Construction Scale construction decisions What level of data is involved (nominal, ordinal, interval, or ratio)? What will the results be used for? Which scaling technique? How many scale divisions or categories should be used (1 to 10; 1 to 7; -3 to +3)? Should a response be forced or be left optional? Should there be an odd or even number of divisions? (Odd gives neutral center value; even forces respondents to take a non-neutral position.) What should the nature and descriptiveness of the scale labels be? What should the physical form or layout of the scale be? (graphic, simple linear, vertical, horizontal) Steps in Scale Construction Test / Validate the scale. Scales should be tested for reliability, generalizability, and validity. Internal validation checks the relation between the individual measures included in the scale, and the composite scale itself. External validation checks the relation between the composite scale and other indicators of the variable, indicators not included in the scale. Steps in Scale Construction Test the scale. Administer the scale to participants. For every item in the scale, about ten respondents will be needed for a pilot test. Factor analyze the results and see if the items fall into dimensions. A .40 loading on the factor is a good rule of thumb for this. Items that do not load .40 or above are discarded. Factor analyze two samples to determine scale stability (test-retest reliability). Steps in Scale Construction Evaluate the scale. Calculate coefficient alpha (Chronbach’s alpha) Alpha provides an estimate of the internal consistency of the test; alpha does not indicate the stability or consistency of the test over time Alpha does not indicate the stability or consistency of the test across test forms, which are estimated using the equivalent forms reliability strategy Steps in Scale Construction Evaluate the scale. Coefficient alpha (Chronbach’s alpha) Cronbach alpha is appropriately applied to norm-referenced tests and norm-referenced decisions (e.g., admissions and placement decisions), but not to criterion-referenced tests and criterion-referenced decisions (e.g., diagnostic and achievement decisions). Tests that have normally distributed scores are more likely to have high Cronbach alpha reliability estimates than tests with positively or negatively skewed distributions Cronbach alpha will be higher for longer tests than for shorter tests Steps in Scale Construction Evaluate the scale. Also concerned with scale validity. Criterion related validity. Trying to establish that the scale predicts something, we can test it against subjects who do or will exhibit a range on the predictor variable. If the scale can discriminate between people with some high level of the attribute from those with a low level, then we have established this. We can use a concurrent validation strategy or one that is truly predictive. (discuss with respect to employment testing) Discriminant validity. Items from our scale should intercorrelate more highly with themselves than with items designed to measure other constructs. Convergent validity. At the same time, we may look for intercorrelations to be higher between our scale items and items for related constructs.