-represent quantity of an attribute numerically
-also used to measure psychological characteristics such as IQ test scores, personality traits, interests
-a set of numbers whose properties model empirical properties of the variables assigned to which numbers are assigned
-the process of setting rules for assigning numbers in measurement
-to represent varying amounts of some trait, attribute, or characteristic
-no best way to assign numbers for all types of traits, attributes, or characteristics but there may be an optimal method for the construct you want to measure
-define when objects fall into the same or different categories with regard to an attribute
-examples: types of objects, college majors, sex, personality types
-procedures for organizing, summarizing, and describing quantitative information (e.g. test scores)
-pictorial (e.g. histogram, bar graph)
-measures of central tendency
-measures of variability (or dispersion)
-methods for making inferences about a population of objects based in information from a sample from that population
-contrast with descriptive statistics
-examples: correlation and regression; chi-square test of association; t-test and ANOVA
-other terms for variability: spread and dispersion
-each term refers to difference among scores within a sample or population
-three common types:
-range
-deviation scores
-variance and standard deviation
-a symmetrical, mathematically defined frequency distribution curve
-highest at the center (most frequent scores are at the mean) and tapering on both sides
-asymptotic towards the axis
-mean, median and mode are equal
-area under the curve is divided in terms of the standard deviation units and can aid in the interpretation of test scores
-distributions can be characterized by the extent to which they are asymmetrical or "skewed"
-describes the steepness of a distribution in its center
-a measure of how data are peaked or flat
-represent one transformation of z which overcomes the disadvantage of working with negative scores
-t-score = (z score x 10) + 50
-an expression of the percentage of people whose score on a test or measure falls below a particular score
-a disadvantage: real diff's b/w raw scores may be minimized near the ends of the distribution and exaggerated in the middle of the distribution
-NRT
-interpretation is based on an individual's relative standing in some known group
-percentiles
-CRT
-interpretation is based on measuring an individual's skill level in relation to a clearly specified standard
-can be used when more than one predictor variable is available
-multiple regression takes into account the correlation b/w each of the predictor scores and what is being predicted
-also taken into account are the correlations among the predictors
Y = a + b_{1}X_{1} + b_{2}X_{2}
Coefficient of determination
-accurate interpretation of correlation coefficients requires another statistic, the coefficient of determination
-calculated by squaring the correlation coefficient (r^{2})
-tells us how much variance in one variable is accounted for by the variance in the other
-predicting values based on knowledge of scores on other variables is a practical use of correlation
-simple linear regression: 1 predictor (x), 1 criterion (y; continuous)
-multiple regression: more than 1 predictor, 1 criterion (continuous)
-logistic regression is used when the variable being predicted is dichotomous (ex. gender)
Every increase of one unit in X will result in an increase of b units in Yy= predicted score on Ya= y interceptb=slope or regression coefficientX=score on the predictor
-numerical values obtained by statistical methods that describe reliability
-affected by the number of items
-reliability generally increases with the number of items
-will generally have a range from 0 to 1
-again, % total variance attributable to "true variance" (true variance/ total)
-R is an index of the theoretical reliability of a test
-R = ratio b/w variance of true score to variance of observed score
-R = (sigma)^{2}_{T}/(sigma)^{2}_{X}
-where (sigma)^{2}_{x}=true variance plus error variance
-parallel forms: two different versions of test that measure the same construct (each form has the same mean and variance)
-alternate forms: two different versions of a test that measure the same construct (tests do not meet the equal means and variances criterion)
-coefficient of equivalence is calculated by correlating the two forms of the test
-items range from weaker to stronger expressions of the variable being measured
-arranged so that agreement with stronger statements implies agreement with milder statements as well
-produces ordinal data
-the item reliability index is the product of the item-score standard deviation and the correlation b/w the item score and the total test score
-provides an indication of the tests internal consistency. the higher the index, the higher the consistency
-sources of error affect which reliability estimate is important
-each coefficient is affected by different sources of error
-goal: use the reliability measure that best addresses the sources of error associated with a test
-anything that occurs during the administration of the test that could affect performance
-environmental factors: temp., lighting, noise, how comfortable the chair is, etc.
-test-taker factors: mood, alertness, errors in entering answers, etc.
-examiner factors: physical appearance, demeanor, nonverbal cues, etc.
-subjectivity of scoring is a source of error variance
-more likely to be a problem with: non-objective personality tests; essay tests; behavioral observations
-the same test is administered twice to the same group with a time interval b/w administrations
-coefficient of stability is calculated by correlating the two sets of test results
-there are multiple sources of error that impact the coefficient of stability:
-stability of the construct
-time/maturation
-practice effects
-fatigue effects
-there are multiple sources of error that can impact the coefficient of equivalence:
-motivation and fatigue
-events that happen b/w the two administrations
-item selection will also produce error
-represents the degree of agreement (consistency) b/w multiple scorers (or judges, raters, observers, etc.)
-calculated with pearson r (or Spearman rho depending on the scale)
-training procedures and standardized scoring criteria increase consistency
-a measure of consistency within the test: how well to all of the items "hang together" or correlate with each other?
-homogeneity: the degree to which all items measure the same construct
1. split-half (with Spearman-brown)
2. Kuder-Richardson (KR-20 & KR-21)
3. Chronbach's Alpha
-SEM is an estimate of measurement precision
-high reliability = small standard deviation of scores = small SEM
-in practice, the SEM is most frequently used in the interpretation of an individual's test scores
-another statistic, the standard error of the difference (sigma_{diff}) is better when making comparisons b/w scores
-scores b/w people, tests, or two scores from the same person over time
-does the test look like it measures what it is supposed to measure?
-has to do more with the judgments of the test TAKER, not the user
-not a statistical issue
-criterion: the standard against which a test or a test score is evaluated
-a good criterion is generally:
-relevant
-uncontaminated
-something that can be measured reliably
-validity coefficient: typically a correlation coefficient between scores on the test and some criterion measure (r_{xy})
-Pearson's r is the usual measure, but may need to use other types of correlation coefficients depending on the data scale
-can also use "expectancy tables" for categorical criterion
-construct: unobservable underlying trait hypothesized to describe or explain behavior
-construct validity is the process of determining the appropriateness of inferences about the construct drawn from test scores
-formulate and test hypotheses derived from theories about the nature of the construct
-a miss wherein the test predicted that the test taker did possess the characteristic when that person did not
-guessing is only an issue for tests where a "correct answer" exists
-not an issue when measuring attitudes
-faking can be an issue with attitudes
-faking good: positive self-presentation
-faking bad: malingering or trying to create a less favorable impression
-random responding
-known as item-endorsement index in other contexts
-the proportion of the total number of test takers who got the item right
-proportion of total test takers who get item right
-ideal average (p_{i}) is halfway b/w chance and guessing at 1.0
-a simple correlation b/w the score on an item and the total score
-advantages: can test statistical significance of the correlation; can interpret % of variability item accounts for (rit2)
-symbolized by d: compares proportion of high scorers getting item "correct" and proportion of low scores getting item "correct"
-higher positive values indicate item passed by more examinees in the upper group, while negative values indicate more from lower group passed the item
· The most adequate conceptualization of a person’s behavior in all its detail” (McClelland, 1951)
· Consistent behavior patterns and intrapersonal processes originating within the individual” (Burger, 1997)
· “an individual’s unique constellation of psychological traits and states” (C&S)
-the relatively distinctive and stable patterns of behavior that characterize an individual and their reactions to the environment
-3 common components: focus on individual diff's; the individual diff's are relatively stable; usually refer to intrapersonal processes of emotions, motivations and cognitive processes
· any relatively enduring characteristic of an individual that distinguishes that person from another
o Example: extraversion, Introversion, openness, contentiousness
· a temporary, or transient presentation of some personality trait or disposition
o Examples: anxious, calm, fearful, embarrassed, happy, sad etc.
· are divided as unique sets of traits and states that are similar in pattern to an identified category of personality within a taxonomy of personalities
· Not all typologies are based on psychological theories with an empirical basis
· temperament of the blood, season of spring and the element of air
· associated with functioning of the liver (blood) makes a person optimistic and cheerful
· associated with the spleen, easily angered, bad tempered and controlling
· associated with the gall bladder; perfectionistic, depressive
·phlegm, winter, water
associated with the lungs, calm and unemotional
Six (modern) approaches to Personality
· Unconscious minds are largely responsible for important differences in behavior styles
· people can be described along a continuum of various personality characteristics
· Inherited predispositions and physiological processes explain individual differences
· keys to individual differences are degree of personal responsibility and self acceptance
· consistent differences are the result of conditioning and expectations
· differences are the result of the way people process information and explain differences in behavior
How are Personality assessments used?
· Employment matching
· Adjustment issues for decisions about military service
· Academic opportunities
· Employment mobility
· Diagnoses, or degree of impact from some trauma
· Inform treatments
· Research and validation of theory
Assessment Methods
· Interviews
· Self report to written questions
· Card sorts (q-sort)
· Responses to ambiguous stimuli
· Interviews or responses of friends, family, spouse, teacher, coworkers, etc.
· Case histories
· Ratings by judges or experts
· Paper and Pencil or computer aided
o Choose a response from options that represent various characteristics of personality
· Procedures for scoring require little judgment
· Allows for implementation of a variety of validity indices
o can be answered and scored quickly (scored reliably)
o Breadth of content
o Can be administered or groups or individuals
-can be answered quickly
-can be administered by computer
-can be scored quickly and reliably
-can be administered in groups or individually
-procedures for scoring require little interpretation
-allows for implementation of a variety of validity indices
o Assuming honesty and capacity/insight to answer questions accurately.
-categorical labels or integers, no meaningful middle grounds between categories
-e.g. 1-single, 2-married
-scales, or levels, of measurement help determine what statistical analyses are appropriate
-enable test users to make accurate score interpretations
-four levels:
-nominal
-ordinal
-interval
-ratio
**NOIR**
-nominal (AKA naming) level
-lowest level of measurement
-ordering is not important, only the label attached to designate a mutually exclusive and exhaustive category
-examples:
-medical diagnoses
-gender
-political party affiliation
-values imply nothing about magnitude of differences between one level to the next
-the numbers do not indicate units of measurement
-no absolute zero point; the ways in which data form ordinal scales are limited
-because of equal intervals between values some mathematic operations are meaningfully appropriate:
-addition and subtraction, statistical tests based on mean scores and/or variance
-the difference between the highest and lowest scores
-sensitive to outliers
-the average of the sum of the squared deviations of each score from the mean
_
s^{2 }= ^{1}/n-1 Σ(Xi - X)^{2}
-the average deviation of each score from the mean
-the standard deviation is the square root of the variance
-expressed in the same units of measurement as the original scores
-only a few extremely high scores and many low scores
-tail goes to the right
Negative skew
-only a few extremely low scores and many high ones
-tail goes to the left
-z-scores can be used to calculate percentiles when raw scores have a normal distribution
-when used in conjunction w/ a Z-table, the z-score reveals the area of the normal distribution below the score in question
-the Z-score table that indicates the proportion of the total number of scores that fall into a certain range of z-scores
-strong negative: (r=-.7)
-moderate/weak negative: (r = -.4)
-a way of interpreting test scores by comparing an individual's results to the scores of a group of test takers
-interpretation is relative
-percentiles
-"top 5%"
-the group of people whose performance on a particular test is analyzed for reference in evaluating the performance of individual test-takers
-a normative sample must be representative or typical of the intended population of interest
-diff's need to be proportionately represented in the sample
-e.g. gender, race/ethnicity
-sampling individuals from subgroups in the population in the same proportion as the population they are part of
-provides greater precision than a simple random sample of the same size
-can guard against the "unrepresentative" sample
-each individual from the population has an equal chance of being included in the sample
-true random sampling is very rare in practice (time & $, ethics, self selection)
-contrast w/ random assignment (random assignment of participants in the selected sample to different experimental conditions)
-a sample that is convenient or available for use
-ISU psychology participant pool
-average performance of different samples of test-takers at various ages/grades
-scores often used as evaluative standards for one's performance on a test (e.g. below average, average)
-concept of "mental age"
-can be interpreted as the mean of all possible split-half correlations, corrected by the Spearman-Brown formula
-Ranges from 0 to 1 with values closer to 1 indicating greater reliability
-"generally acceptable" values are .70-.90
-coefficient alpha above .90 may be "too high;" indicating redundancy
-twins raised apart support genetic component in intelligence based on degree of similarity
-though not as similar as if raised together
-children adopted from mothers with higher IQs tend to have higher IQs, irrespective of adoption family's SES
-though those in higher SES have higher IQs
-central tendency measures are used to describe the typical response seen in a sample of observations
-variability measures are used to describe how much fluctuation in scores there are in a sample of observations
-we need both to interpret a person's score
-both variance and standard deviation reflect the variability of scores about the mean of the group
-typical distance of a score from the mean
-strong relation: r=.7 or higher
-moderate relation: r=.4 or lower
-standard error of the estimate (SE)
-indicates magnitude of errors in estimation
-higher correlations produce smaller SE
-lower correlations produce larger SE
-most widely used psychological test in the world
-developed by Hathaway and McKinley in the late 1930s and early 1940s
-university of Minn. hospital and persons w/in community
-originally designed to assist w/ the diagnosis of different psychiatric disorders
-at one time was popular for use in employment screening
MMPI-2
-items revised, removed, replaced
-norm: 1138 males and 1462 females b/w 18 & 80 from several regions and divers communities w/in the US
-increased attention to "non-pathological" interpretation
-checklist: list of behaviors, thoughts, events, etc.: each is marked if it is present and/or on intensity
-can be filled out by individual or an evaluator
-rating scale: evaluator provides a score to indicate relative standing on a list of characteristics
-clinicians sometimes want info from clients that is best obtained through psychological tests
-many purposes
-general diagnostic decisions
-ID positive and negative personality traits/states/types
-working with "stuckness"
-info from tests is more "scientifically reliable" than the info from a clinical interview
-the semantic distinction is blurred
-assessment requires greater education, training and skills than simply administering a test
-educational
-clinical
-counseling
-geriatric
-business
-military
-other settings
-intelligence: e.g. Wechsler adult intelligence scale (WAIS), Stanford-Binet intelligence scales (SBIS)
-personality: e.g. MN multiphasic personality inventory-2 (MMPI-2)
-interview
-portfolio
-case history data
-behavioral observation
-role-play tests
-computer as tools
-by what (and how) they measure
-content
-format
-administration procedures (e.g. computer-assisted vs. paper/pencil)
-scoring and interpretation procedures
-psychometric quality
-reliability: does the test produce consistent measurement results?
-validity: does the test measure effectively what it purports to measure?
-adequate norms: was the test developed using samples similar to the people taking the test?
-test catalogues
-test manuals
-reference volumes
-journal articles
-online databases: e.g. PsychINFO and PsychARTICLES
-scaling involves quantifying and assigning a value
-classification involves categorization
-ordinal
-but, most of the time we treat psychological measures as being on interval scales
-ease of statistical manipulation
-works well in practice
-most frequently observed score
-only measure of central tendency that can be used with Nominal data
-the average of a set of scores
-found by summing all values and then dividing that sum by the total number of observed values
-requires interval or ration data
-sensitive to every score in the sample, and may be inappropriate with skewed data/outliers
-point which divides the group in half so that 50% of scores fall above it and 50% fall below it
-better measure of central tendency than the mean when the data are skewed because it is unaffected by extreme scores
-data must be in an interval or ration scale for the computation of means and other parametric statics to be valid
-therefore, if data are measured on an ordinal scale, the median but not the mean can serve as a measure of central tendency
_
s=√Σ(x-x)^{2}/(n-1)
σ = √Σ(X-μ)^{2}/N
-a distribution having only one peak
-type of kurtosis
-distribution is "flat" with thin tails
-type of kurtosis
-distribution is "peaked" with fat tails
-type of kurtosis
-distribution is somewhere in between
-normal tails, normal peak
-68.2%
-34.1% within one standard deviation above the mean
-34.1% within one standard deviation below the mean
_
-x is the sample mean
-x is the individual score
-s is the sample standard deviation
_
Z = x-x
----------------
s
-find the z-score on the table, and when you find the percentile, subtract that value from 1.
-ex: Z=-1.4
-find -1.4 on z-table (get .9192)
-calculate 1-.9192
-solution: 8th percentile
-to find number of scores between 96th percentile and 8th percentile:
-.96-.08=.88
-multiply .88 by number of scores (60)
-60x.88=53
-say you have a score in the 65th percentile
-X=65
-s=10
-apply the above values to the Z-score formula and solve
-Z-score will be negative
-to find raw score of an individual who scored in the 25th percentile:
-1-.25-.75
-find percentile closest to .75 on Z-table (.7517)
-determine what the Z-score is by using the decimal places provided on the Z-table (.68)
-plug found z-score into Z-formula and solve
-units are not equal on all parts of the scale
-percentiles are an ordinal scale
-diff's b/w individuals near the middle are magnified and difference at the extremes are compressed
-mean: 50
-SD: 10
-correlation coefficient
-ranges from -1 to 1
-statement about the direction of the relation
-statement about the strength of the relation
-also called Person correlation coefficient, or Pearson product-moment coefficient of correlation
-used when the relationship b/w two variables is linear
-used when both variables are continuous
-need to use other estimates if non-linear relations or other variables that are not continuous
-pictorial representation of pairs of scores
-useful as a first estimate of the strength of the relation
-gives an idea of the shape of the distribution
-linear, curvilinear, clustering of scores
-useful to check for outliers/typos in the data set before running statistical analyses
-when variables A and B are related, it may be:
-A causes B
-B causes A
-C influenced both A and B (confound)
-need true experiment to establish causation
-technique must consider both the scales of measurement and correlation b/w the two variables
-linear regression does just that
-this is how we deal w/ the fact that the predictors and criterion are often on different scales of measurement
-simultaneous prediction w/ multiple predictors
-takes into account intercorrelation b/w predictor variables
-info on overall effect of all predictors together but also individual contributions
-change R^{2} when adding predictors; incremental validity = proportion of variance in the criterion explained by a predictor above and beyond what is explained by other predictors
-trait: any distinguishable, relatively enduring way in which one individual varies from another
-exist as a construct
-state: distinguishable, but less enduring
-both traits and states are relative
-e.g. he is shy; unstated comparison b/w him and other people
Test-related behavior predicts non-test-related behavior
(assumption)
-in testing, factors other than what a test attempts to measure will influence performance on the test
-error variance: a component of a test score attributable to sources other than the trait or ability measured
-e.g. a test taker is sick while taking an intelligence test; a test-user failed to follow the instruction when giving a test
-most controversial assumption
-tests are tools, they can be used properly or improperly
-the performance data of a particular group of test-takers that are designed for use as a reference for evaluating or interpreting individual test scores
-part of test standardization
-derived from a "representative" sample of a country
-often developed using stratified sampling methods
-indicate how test scores for a measure compare to the norms for other measures of the same construct
-calculated using percentile scores; Equipercentile method
-created by selecting sub-groups from he normative sample
-limited by the sampling techniques used to create the normative sample
-can also be developed for a measure for use in a specific area
-most useful in cases where the national norms may not represent the local population
-another way to provide context for interpreting a test score
-distribution of scores from one group of test-takers used as basis for the calculation of test scores for future administrations of the test
-e.g. GRE, SAT
-NRT: covers a broad content domain; emphasizes discrimination among individuals (how did a person perform relative to other people who took he test?)
-CRT: focuses on a more specific content domain; emphasizes description of what individuals can and cannot do (has the person met the criteria for the test?)
-NRT: interpretation requires a clearly defined comparison group; scores are usually reported in terms of standard scores or percentile ranks
-CRT: scores are usually reported in absolute terms such as percentage of correct answers or pass/fail; interpretation requires a clearly defined standard of performance
-important to understand that normed scores do not represent standards or goals to be achieved by students
-norms simply describe typical or normal performance
-may have little or no application at the upper end of the knowledge/skill continuum \
-more difficult to make proper comparisons between test takers
-subset of a population; needs to be representative of the population of interest
-applicability of tests: to whom a test is developed from and used for?
-generalizability of findings: to what population can in generalize my findings?
-inferential statistical methods (e.g. correlation, regression) assume random sampling
-observations have 2 parts:
-true score (T): actual standing on characteristic
-error (E): error affecting the observed score
-E is assumed to be random (i.e. not systematic)
X = T + E
where: X is the observed score on some test. T is the true score on the same construct and E is the error affecting the observed score
-random
-systematic
1. test items are split in half
2. the scores of each half of the test are correlated (using Pearson's r)
3. correlation coefficient is corrected using the Spearman-Brown formula
(Spearman brown corrects for the number of items)
-how much measured scores vary from true score (on average)
-describes the shape of the entire domain of measurements
-in practice we don't have a large number of trials per individual, hard to capture "true score"
-typically we only have one score from one trial
-that one score represents only one point on the theoretical distribution of potential test scores, but where?
-SEM for an individual's test score can be calculated from other statistics that provide information about true scores:
-standard deviation of the distribution of test scores (sigma)
-reliability coefficient of the test (alpha)
-sigma_{meas}=sigma * square root of 1-r_{xx}
-the observed score will be the best estimate of the true score
-but because of measurement error, it is not an exact indicator
-can use SEM to construct a range of values around the observed score that contains the true score at a specified level of confidence
-confidence intervals (CI) estimate the true score
-CI=observed score +/- Z_{crit}*sigma_{meas}
-how does A's performance on test 1 compare to A's performance on test 2?
-how does A's performance on test 1 compare to B's performance on test 1?
-How does A's performance on test 1 compare to B's performance on test 2?
1. face
2. content
3. criterion
4. construct
5. consequential
-pros: test-users want to believe that it works, it should appear valid and believable to be instituted in non-psychological settings
-cons: face valid means people know what they're taking; likely to be able to fake results
-step 1: precisely define construct
-step 2: domain sampling: used to determine behaviors that might represent that construct
-step 3: determine adequacy of domain sampling
-criterion obtained at the same time as the test data
-test predicts a criterion in the future
-difference between predictive and concurrent validity is timing at which the criterion is obtained
1. precisely define the construct
2. form hypotheses about how that construct should relate to observable behaviors
3. hypothesize how your construct is related to other constructs
4. predict what these inter-relations should be like based on theory
-do subscales correlate with the total score?
-do individual items correlate with subscale or total scale scores? (internal consistency)
-do all of the items onto a single factor using a factor analysis?
-"data reduction" technique: used to describe inter-relations among items in terms of a smaller number of underlying factors
-these factors are often interpreted as representing underlying constructs
interpretation depends on rates:
-base rate: extent to which a particular trait, behavior, characteristic, or attribute exists in the population
-hit rate: proportion of people accurately identified as possessing or exhibiting some characteristic
-miss rate: proportion of people the test fails to identify as having or not having a particular characteristic
-bias and fairness are both related to validity, but are separate issues
-bias is a factor inherent in a test that systematically prevents accurate, impartial measurement
-biased tests cannot be used fairly
-a valid test may be used in an unfair manner or a fair manner
-controversial book "the bell curve"
-better explanation: even aptitude tests measure the effects of experience
-administration procedures can systematically influence the results of particular groups
-rating errors: leniency error, tendency to be generous
-severity error: movie critics
-central tendency error: avoiding extremes
-halo effect: unrelated characteristics; can be positive or negative
-addition of points of different cutoff scores
-different scoring of items
-elimination of items w/ differential item functioning between groups
-separate lists
-within-group norming
-banding
-preference policies
-fairness is an application issue rather than a statistical issue
-extent to which a test is used in an impartial, just, and equitable way
-approach of determining if using a test is "fair"
-a test is fair if it helps us find the best canditates
-goal is to predict who would perform the best
-indifferent to the demographics of the applicants
-if demographic variables help predict success, that information should also be used in the selection process
-approach to determine if using a test is fair
-explicitly recognize race and gender differences
-selection procedures are biased if the proportion of applicants admitted from a demographic group differs from their proportion in the population
-goal is to have a group that is representative of the population
-less emphasis is placed on how well the people will do once selected
-approach to determine if using a test is fair
-compromise b/w the other two approaches
-embraces the notion that tests should select the most qualified people WHILE ALSO
-taking measures that will serve the interests of minority group members
-when contemplating the utility of a test, we need to consider:
1. psychometric soundness: reliability and validity
2. costs and benefits of use: economic; non-economic
Utility analysis
-use expectancy table
-Taylor-Russel table
-provides estimate of % persons chosen by test that will meet criterion
-set fixed "cut-score" for minimal performance based on expert opinion
-could be stimulated by anything: societal trends; personal experience; a gap in the literature; need for a new tool to assess a construct of interest
-operational definition: psychological constructs are not directly observable; meant to predict observable behaviors; must be defined in terms of how it will be measured
-yields an estimate of test-taker's position within a defined population on the trait being measured
-yields an estimate of test-taker's mastery (or amount) of the criterion
-criterion: domain of subject matter test designed to measure
-easily comprehended: most should aim for 6th grade reading level
-avoid "double-barreled" questions: I vote democrat because I support social programs
-mixes positively and negatively worded items: helps avoid "acquiescence response set"
-makes the item pool deep: a real advantage in later test revision; # items developed should be double intended scale length
-taker presented with two test stimuli and asked to compare
-select the behavior that is more justified
-sorting tasks: takers asked to order stimuli on the basis of some rule
-categorical scaling: placed in categories
-comparative scaling: placed on continuum
-our inferences from testing depend on generalizability
-to this person: normed on a rep. sample; comprehension of test procedures by groups
-to this setting: testing conditions; administrations
-rule of thumb: no fewer than 5 subjects, as many as 10 for each test item
-preliminary assessment of a variety of factors: item content; administration; psychometrics
-usually done in small samples: stability of findings; sampling issues
-may do more test construction before moving on to "official" test tryout
-X=person's true level of trait/ability
-Y=how much trait is exhibited on an item
-IRT: studies factors affecting the relationship b/w X & Y
-looks at distribution of test-scores on a single item
-maximizes errors in Mx method
-helps in removing "poor" items
-increases test reliability
-across administrations
-across items
-across raters
-minimizes errors in item difficulty and discrimination
-helps in removing "poor" items
-increases test reliability
-across levels of latent threats
-within items
-degree to which item differentiates among those with high/low levels of trait
-item characteristic curve: ICC: probability relationship b/w test-taker's response and trait level (single item)
-information function (IF): tell us which items work best for test-takers at given lev