Get started today!

Good to have you back!
If you've signed in to StudyBlue with Facebook in the past, please do that again.

- StudyBlue
- Iowa
- Iowa State University
- Psychology
- Psychology 440
- Armstrong
- PSYCH 440 Study Guide (2013-14 Armstrong)

StudyBlue

Testing

the process of measuring variables by means of devices or procedures designed to obtain a sample of behavior

Testing in the US

-Alpha & Beta for personality risk factors in WW1

-WAIS intelligence scale

Advertisement
)

Assessment

the process of gathering and integrating data for the purpose of making an evaluation

Assessment

Assessment - the process of gathering and

integrating data for the purpose of making

an evaluation

• Most effective when information is

obtained using multiple techniques

Scaling

represents quantity of an attribute numerically

- Example: weight, height, age, IQ test scores, etc

Scale

set of numbers whose properties model empirical properties of the variables to which the numbers are assigned

Scaling

process of selecting rules for assigning number to measurement of varying amounts of some trait, attribute, or characteristic.

No best way to assign numbers

Scaling

represent quantity of an attribute

numerically

Scaling

-represent quantity of an attribute numerically

-also used to measure psychological characteristics such as IQ test scores, personality traits, interests

Scale

Scaling

-the process of setting rules for assigning numbers in measurement

-to represent varying amounts of some trait, attribute, or characteristic

-no best way to assign numbers for all types of traits, attributes, or characteristics but there may be an optimal method for the construct you want to measure

Classification

define when objects fall into the same or different categories in regards to an attribute

- College majors, sex, personality types

Classification

define when objects fall

into the same or different categories with

regards to an attribute

Classification

Define when objects fall into the same or different categories in regard to an attribute

Classification

-define when objects fall into the same or different categories with regard to an attribute

-examples: types of objects, college majors, sex, personality types

Advertisement

Advantage of Standardized Measurements

- Objectivity
- Quantification (put in terms of numbers)
- Communication
- Economy (more efficient)
- Scientific generalizabilty

Advantages of Standardized

Measurements

Objectivity

Quantification

Communication

Economy

Scientific Generalizability

Characteristics of an Effective Test

- Reliability- does the test produce consistent measurement results?
- Validity- does the test measure effectively what it purports to measure?
- Adequate norms- was the test developed using samples similar to the people taking the test?

Characteristics of an Effective Test

Reliability

Validity

Adequate norms

Who is involved in assessment?

- Test developers- Psychologists required to adhere to ethical standards (APA, AERA)
- Test users- counselors, other therapists, teachers, human resources, researchers
- Test takers
- Society at large

Techniques of Psychology Assessment

- Tests
- Interviews
- Case history data
- Behavioral observation
- Role-playing

Techniques of Psychological

Assessment

Tests

Interviews

Case History Data

Behavioral Observation

Role-Playing

Computer-Based Instruments

What are some assumptions we make in psychological testing and assessment.

- Psychological States and Traits can be measured.
- Various approached to measuring aspects of the same thing can be useful.
- Various sources of error are part of the assessment process
- Test-related behavior can predict behavior in other settings
- Present-day behaviors can predict future behaviors

Wilhelm Wundt

German medical doctors who studied how individuals were similar instead of different

- Described human abilities w/ respect to reaction time, perception, and attention span

Wilhelm Wundt

Studied how individuals were similar rather than different

Sir Frances Galton

- Studied genetic influence
- Attempted to quantify differences through classification
- DEVELOPED FIRST CORRELATION COEFFICIENT
- Created anthropometic lab in London
- Major proponent of the eugenics movement
- Not very well respected

Sir Frances Galton

1. Studied genetic influence using pedigree charts

2. Attempted to quantify individual differences by classifying people

3. Developed first correlation coefficient later refined by Karl Pearson

4. Created Anthropometric Laboratory in London in 1884

4. Major Proponent of the Eugenics Movement

Sir Frances Galton

-Studied genetic influence using pedigree charts

-Quantified individual differences

-Developed first correlation coefficient

-Eugenics movement

Alfred Binet

- Commissioned by France to identify "subnormal" children
- Developed first intelligence test in 1905
- Mental age proposed as evaluation criterion
- Test revised by Lewis Terman, current revisions still widely used

Alfred Binet

1.Commissioned by France’s education system to help identify “subnormal” children

2. Developed first intelligence test in 1905 with Theodore Simon

3. Mental Age proposed as criterion for evaluation

4. Test revised by Lewis Terman at Stanford, current revisions still widely used

Alfred Binet

-Developed first intelligence test

-Proposed 'mental age'

Alfred Binet

Developed a diagnostic test to identify mental retardation

James McKeen Cattell

- First American to systematically study assessment of individual differences
- A student of Wundt, but more influenced by Galton's methods
- Studied differences in reaction time
- Brought over early intelligence tests
- Coined term "mental test"
- Named daughter "Psyche"

James McKeen Cattell

1. First American to systematically study assessment of individual differences

2. A student of Wundt, but more influenced by Galton’s methods

3. Studied differences in reaction time

4. Coined the term “mental test”

5. Named his daughter “Psyche”

James McKeen Cattell

-Systematically study individual differences

-Coined 'mental test

-Named daughter 'Psyche'

James McKeen Cattell

Coined the term "mental test" and described 50 measures that primarily assessed sensory and motor abilities

David Wechsler

- Clinical psychologist
- Designed test to measure adult intelligence (Wechsler Adult Intelligence Scale

Sputnik

Sparked public interest in education

Sputnik

Sparked interest in educational testing

Culture

"The socially transmitted behavior patterns, beliefs, and products of work of a particular population, community, or group of people"

History suggests cultural bias in testing can have an adverse effect.

Culture

“the socially transmitted behavior patterns,

beliefs, and products of work of a particular

population, community, or group of people”

(Cohen, 1994)

Culture and testing

- Many early tests had no individuals in standardization samples
- Translation problems
- Remains an issue

Culture and Testing

Many early tests had NO minority individuals in

standardization samples,

Items culturally grounded in the dominant

American culture:

• “Who was the first person to discover

America?”

Translation problems: no corresponding

object/word, changes in meaning

Informed consent

Permission to proceed with a diagnostic, evaluative, or therapeutic service on the basis of knowledge about the service, its risks, and its potential benefits

Variable

Characteristics or attributes of objects in a population that are not constant

Variable

Characteristics or attributes of objects

(people, places, things, animals, etc.) in a

population

that are not constant

Variability

measures are used to describe

how much fluctuation in scores there are in a sample of observations

Variable

Characteristics of objects in a population that are not constant

Measurement

process of assigning numbers or symbols to a characteristic or attribute according to a set of rules

Measurement

The process of assigning numbers or labels to characteristics of people, objects, or events according to a set of rules

Discrete

categorical labels or integers, no meaningful middle ground between categories

Continuous

numbers do not represent categories, middle ground between units possible

Descriptive statistics

Procedures for organizing, summarizing, and describing quantitative information

Examples: batting average, census data, horsepower

Descriptive Statistics

Procedures for organizing, summarizing,

and describing quantitative information

Descriptive statistics

-procedures for organizing, summarizing, and describing quantitative information (e.g. test scores)

-pictorial (e.g. histogram, bar graph)

-measures of central tendency

-measures of variability (or dispersion)

Inferential statistics

methods for making inferences about a population of objects based on information from a sample of that population

Inferential statistics

-methods for making inferences about a population of objects based in information from a sample from that population

-contrast with descriptive statistics

-examples: correlation and regression; chi-square test of association; t-test and ANOVA

Four types of scales

Nominal

Ordinal

Interval

Ratio

- Get more numberlike as they go down

Measures of central tendency

Mode (most frequently observed)

Mean (average score)

Median (50th percentile)

Measures of Central Tendency

mean

median

mode

Measures of variability

Three common types

- Range
- Deviation scores
- Variance and standard deviation

Measures of Variability

-Synonyms are spread and dispersion

Common types:

Range

Range

Deviation Scores

Variance and Standard Deviation

Measures of variability

-other terms for variability: spread and dispersion

-each term refers to difference among scores within a sample or population

-three common types:

-range

-deviation scores

-variance and standard deviation

The Normal Distribution

- A symmetrical, mathematically defined frequency distribution curve
- Highest at the center (most frequent scores are the mean)
- Asymptotic towards the abscissa
- Mean, median, and mode are equal

Normal Distribution

-Symmetrical, mathematically defined frequency distribution curve

-Asymptotic towards the abscissa

-Mean, median, and mode are equal

The normal distribution

-a symmetrical, mathematically defined frequency distribution curve

-highest at the center (most frequent scores are at the mean) and tapering on both sides

-asymptotic towards the axis

-mean, median and mode are equal

-area under the curve is divided in terms of the standard deviation units and can aid in the interpretation of test scores

Skewness

Positive skew: only a few extremely high scores and many low scores. Tail goes towards the high curve

Negative skew: only a few extremely low scores, and many high scores. Tail goes towards the low curve.

Skewness

Positive - few extremely high scores and many low

Negative - few extremely low scores and many high

Negative - few extremely low scores and many high

Skewness

-distributions can be characterized by the extent to which they are asymmetrical or "skewed"

Kurtosis

Refers to the steepness of a distribution

- Platykurtic- relatively flat
- leptokurtic- relatively peaked
- mesokurtic- somewhere in the middle

Kurtosis

Describes the steepness of a

distribution in its center

„

Platykurtic = flat

„

Leptokurtic = peaked

„

Mesokurtic = somewhere in between

Kurtosis

-describes the steepness of a distribution in its center

-a measure of how data are peaked or flat

Standard Scores

Raw score that has been converted from one scale to a new scale with a prescribed mean and SD

Why use standard scores?

- More easily interpretable than raw scores
- We can tell where a score falls in relation to other scores
- Allow for easier comparisons of both similar and dissimilar scores

Standard Scores

A raw score that has been converted from

one scale to a new (standardized) scale

with a prescribed mean and SD

„

Typically expressed in terms of number

of standard deviat

ions from the mean

• All standard scores have equal unit

sizes across the distribution

Standard Scores

-Raw score that has been converted from one scale to a new standardized scale

-Typically expressed by number of standard deviations from the mean

z score

results from the conversion of a raw score into a number indicating how many SD units the raw score is above or below the mean of the distribution

z= (x-xbar) / s

t score

a standard score calculated using a scale with a mean set at 50 and an SD set at 10

Z-Scores

A standard score where the mean of the

scores is set at zero (0) and standard

deviations are set at

intervals of one (1)

T-Scores

T-scores represent one transformation of z

which overcomes the disadvantage of

working with negative scores

True Score –

the true standing on some

construct

T-Scores

T-score = (z score * 10) + 50

T-scores

-represent one transformation of z which overcomes the disadvantage of working with negative scores

-t-score = (z score x 10) + 50

Stanine

standard score derived from from a scale with a mean of 5 and an SD of approx 2

percentile

expression of the percentage of people whose score on a test or measure falls below a particular raw score, or a converted score that refers to a percentage of testtakers; contrasts with percentage correct

Percentiles

A raw score that has been converted into the

percentage of a distribution that falls below

that particular raw score

„

Widely used in test manuals as well as other

literature on commercially published

standardized tests

Percentiles

-Raw scores that have been converted into the percentage of a distribution that falls below that particular raw score

Percentiles

-an expression of the percentage of people whose score on a test or measure falls below a particular score

-a disadvantage: real diff's b/w raw scores may be minimized near the ends of the distribution and exaggerated in the middle of the distribution

Skewness and measures of central tendency

If a distribution is negatively skewed, the order, from left to right, is mean, median, mode.

If a distribution is positively skewed, the order, from left to right, is mode, median, mean.

Z-scores and normal distributions

Assumptions we can make:

- Approx 68% of the scores are between 1 and -1 SDs
- Approx 95% of scores are between 2 and -2 SDs
- Approx 99.7% of scores are between 3 and -3 SDs

Z-Scores and the Normal

Distribution

„

If we have a normal distribution we can

make the following assumptions:

Approximately 68% of the scores are

between a z-score of 1 and -1

Approximately 95% of the scores will be

between a z-score of 2 and -2

Approximately 99.7% of the scores will be

between a z-score of 3 and -3

Z-scores: pros and cons

Pros:

- Indicates each person's standings as compared to the group mean
- Can be easily converted to percentiles

Cons:

- Negative z values can be difficult to work with and explain
- Dealing with fractional z values can be a hassle

Z-Scores: Pros and Cons

Advantages:

„ Indicates each person’s standing as compared to

the group mean

„ Can easily be converted to percentiles

Disadvantages:

„ Negative z values can be difficult to work with

and explain

„ Dealing with fractional z values can be a hassle

Z-scores Pros & Cons

Advantages:

Indicates standing compared to group mean

Easily converted to percentiles

Disadvantages:

Negative z-values are difficult

Fractional values are a hassle

Indicates standing compared to group mean

Easily converted to percentiles

Disadvantages:

Negative z-values are difficult

Fractional values are a hassle

Correlation

Statistical technique which allows us to make inferences about how two (or more) variables relate (co-relate) to each other (linearly)

Correlations

- Can range from -1.0 to +1.0
- When direction is positive, high scores on one variable are associated with high scores on the other
- Reversed interpretation when correlation is negative

Correlation

A statistical technique which allows us to

make inferences about how two (or more)

variables relate (co-relate) to each other

(linearly)

„

Expressed using a correlation coefficient

• statement about the direction of a relation

• statement about the strength of the

relation

Correlation

-Statistical technique which allows us to make inferences about how two or more variable relate to each other

-Use correlation coefficient (strength & direction)

-Use correlation coefficient (strength & direction)

Testing Assumptions

- Psychological states and traits exist
- Psychological states and traits can be quantified and measured
- Test-related behavior predicts non-test-related behavior
- Measures have both strengths and weaknesses
- Various sources of error are part of the assessment process
- Testings and assessments can be conducted in a fair and unbiased manner
- Testing and assessment benefit society

Assumptions we make with testing

1.Psychological States or Traits exist, and

can be quantified and measured

2.Different approaches to measuring aspects

of the same thing can be useful

3.Various sources of error are part of the

assessment process

4.Test-related behavior can predict behavior

in other settings

5.Present-Day behaviors can predict future

behaviors

Assumptions we make with testing

1. Psych traits exist

2. Different approaches to measuring same thing can be useful

3. Various sources of error are part of the process

4. Test-related behavior can predict behavior in other settings

5. Present-day behavior can predict future behaviors

Testing Assumptions

1. Psych states and traits exist

2. Psych states and traits can be quantified and measured

3. Test-related behavior predicts non-test-related behavior

4. Measures have both strengths and weaknesses

5. Various sources of error are part of the assessment process d

6. Conducted in a fair and unbiased manner

7. Benefits society

Characteristics of a good test

- Reliability
- Validity

Norm referenced

A way of interpreting test scores by comparing an individual's results to the scores of a group of test takers

Interpretation is relative

Norm-referenced

-NRT

-interpretation is based on an individual's relative standing in some known group

-percentiles

Criterion referenced

Interpretation is based on measuring an individual's skill level in relation to a clearly specified standard

Not measured in relation to others

Criterion-referenced

-CRT

-interpretation is based on measuring an individual's skill level in relation to a clearly specified standard

Correlation coefficient

index of the strength of the linear relationship between two continuous variables expressed as a number than can range from -1 to +1.

Simple regression

the analysis of the relationship between one independent variable and one dependent variable

Multiple regression

The analysis of relationships between more than one independent variable and one dependent variable to understand

Multiple Regression

is used when multiple predictors are used

Can be used when more that one predictor

variable is available

Multiple regression takes into account the

correlation between each of the predictor

scores and what is being predicted

Also taken into account are the correlations

among the predictors

Y = a + b1X1 + b2X2

Multiple Regression

-Takes into account the correlation between each of the predictor scores and what is being predicted

-Also considers correlations among the predictors

Multiple regression

-can be used when more than one predictor variable is available

-multiple regression takes into account the correlation b/w each of the predictor scores and what is being predicted

-also taken into account are the correlations among the predictors

Y = a + b_{1}X_{1} + b_{2}X_{2}

Coefficient of Determination (r^{2})

a value indicating how much variance is shared by two variables being calculated. Indicates the amount of variance accounted for by the correlation coefficient

Calculated by squaring the correlation coefficient

Coefficient of Determination

ccurate interpretation of correlation

coefficients requires another statistic, the

coefficient of determination

„

Calculated by squaring the correlation

coefficient (r2)

„

The coefficient of determination tells how

much variance in one variable is accounted

for by the variance in the other

Coefficient of Determination

-Calculated by squaring the correlation coefficient (r^{2})

-Tells how much variance in one variable is accounted for by the variance in another

Coefficient of determination

-accurate interpretation of correlation coefficients requires another statistic, the coefficient of determination

-calculated by squaring the correlation coefficient (r^{2})

-tells us how much variance in one variable is accounted for by the variance in the other

Correlation of Alienation

(1-r^{2}) tells us how much variance is not accounted for

Prediction

- Predicting values of one variable based on knowledge of scores on other variables
- Simple Linear Regression is used when one variable is used to predict values
- Multiple regression is used when multiple predictors are used

Prediction

Predicting values of one variable based on

knowledge of scores on other variables is

a practical use of correlation

Prediction

-predicting values based on knowledge of scores on other variables is a practical use of correlation

-simple linear regression: 1 predictor (x), 1 criterion (y; continuous)

-multiple regression: more than 1 predictor, 1 criterion (continuous)

-logistic regression is used when the variable being predicted is dichotomous (ex. gender)

Linear Regression Equation

Y=a+bX

Every increase of one unit in X will result in an increase of b units in Yy= predicted score on Ya= y interceptb=slope or regression coefficientX=score on the predictor

Measurement Reliability

Refers to stability or consistency of measurement

Matter of degree- not an all or non proposition

Measurement Reliability

Stability or Consistency of measurement

Matter of degree - not all or none

What is Measurement Reliability?

Refers to stability or consistency of measurement.

„

Reliability is a matter of degree

• not an all-or-none proposition

Reliability is NOT concerned with

- Are we measuring what we intended to measure?
- The appropriateness of how we use the info
- Test bias

These are validity issues

Classical Test Theory

Any measurement score yielded from some test will be the product of two components

- True score- the true standing on some construct
- Error- the part of the score that deviates from that true standing on the construct

Classical Test Theory

Any measurement score yielded from some test will be the product of:

True score (T) >> True standing

Error (E) >> Deviation from construct

X = T + E

x - observed score

Error (E) >> Deviation from construct

X = T + E

x - observed score

Classical Test Theory

Any measurement score yielded from some test

will be the product of two components: true score and error

X = T + ε

X = observed score on some test

T = true score on some construct

ε = error affecting the observed score

Reliability Coefficients

- Numerical values obtained by statistical methods that describe reliability
- Have similar properties to correlation coefficients
- Will generally range from 0 to 1, but negative values are possible but not likely
- Usually indicates the proportion of true variance in the test scores divided by the total variance observed in the scores

Reliability Coefficients

Numerical values obtained by stat methods

Simliar to correlation coefficients

Simliar to correlation coefficients

Range of 0 to 1 (neg possible but not likely)

Affected by number of items

Reliability Coefficients

Numerical values obtained by statistical

methods that describe reliability

„

Reliability coefficients have similar

properties to correlation coefficients

„

Will generally have a range from 0 to 1

• negative values possible but not likely

„

Affected by the number of items

Reliability coefficients (R)

-numerical values obtained by statistical methods that describe reliability

-affected by the number of items

-reliability generally increases with the number of items

-will generally have a range from 0 to 1

-again, % total variance attributable to "true variance" (true variance/ total)

The reliability coefficient (R)

-R is an index of the theoretical reliability of a test

-R = ratio b/w variance of true score to variance of observed score

-R = (sigma)^{2}_{T}/(sigma)^{2}_{X}

-where (sigma)^{2}_{x}=true variance plus error variance

Platykurtic vs. Leptokurtic

Platykurtic- flat distribution of scores, not enough scores at the center

Leptokurtic- peaked distribution, too many scores at the center

Leptokurtic- peaked distribution, too many scores at the center

Validity vs Reliability

Validity - measures the degree in which the test measures what it was created to measure

Reliability - measures the degree in which the tests results are consistent and stable.

Reliability - measures the degree in which the tests results are consistent and stable.

Assimilation vs Accommodation

Assimilation - fitting information into existing schemas

Accommodation - modifying existing to consider new info and/or experiences

Accommodation - modifying existing to consider new info and/or experiences

Norm vs Criterion Referenced

Norm referenced - interpretation is based on an individual's relative standing in a particular group

Criterion referenced - interpretation is based on measuring an individual's skill level in relation to a clearly specified standard.

Norm vs. Criterion Referenced

- Norm referenced tests - a good item is one where people who score high tended to get it right, and vice versa
- Criterion referenced tests - items need to assess mastery of the concepts

Norm vs. Criterion Referenced

NRT - Interpretation is based on relative standing

CRT - Interpretation is based on measuring skill level in relation to specified standard

Fluid vs Crystallized intelligence

Fluid - nonverbal, metal efficiency, adaptive and new learning capabilities, related to mental operations and processes.

Crystallized - acquired skills and knowledge, well established cognitive functions

Parallel vs Alternative Forms

Parallel - two different versions of a test that measure the same construct AND have the same means and variance

Alternate - two different versions of a test that measure the same construct BUT do not meet the equal means and variances criterion

Parallel or alternate forms

-parallel forms: two different versions of test that measure the same construct (each form has the same mean and variance)

-alternate forms: two different versions of a test that measure the same construct (tests do not meet the equal means and variances criterion)

-coefficient of equivalence is calculated by correlating the two forms of the test

Homogenous vs Heterogenous Tests

Homogenous - measures only one construct

Heterogenous - measures more than one construct

Convergent vs Discriminant Validity

Convergent - the measure should highly correlate with other tests designed to measure the same construct. Doesn't have to measure the exact construct - similar are okay

Discriminant - Similar to convergent validity, However, the measure should not correlate with measures of dissimilar constructs

Empirical vs Theoretical Scales

Empirical - based on scientific data previously collected. Heterogenous and not all of the items are highly correlated. Low internal consistency, may have no face validity

Theoretical - based on a general theory. These types of scales are homogenous, have high internal consistency, and often have strong face validity

Content vs. Construct Validity

Construct validity - the extent to which your test/scale adequately assesses the theoretical concept that you say it does

Content validity - whether the items on your test actually test what you're looking at, and that the test is representative of it.

Concurrent vs Predictive

Concurrent validity - index of degree to which test score is related to a criterion measure obtained at the same time

Predictive validity - index of the degree to which a test score predicts scores on some future criterion

Likert vs Guttman Scale

Likert - gives five or seven alternative responses on some continuum for the test taker to choose from

Guttman - gives choices that range from a weaker to a stronger variation of a variable being measured. Agreement with one indicates agreement with all the ones before.

Cronbach's Alpha vs Pearson's r

Alpha - mean of all split-half correlations. Affected by the number of items

Pearson's r - measures the linear relationship between two variables. Always between -1 and 1

Face validity vs Content validity

Face validity - form where psychologists or researchers determine if the test is measuring what it is intended to measure

Content - refers to how well a test measures the behavior to which it is intended

Standford Binet vs Wechsler

Stanford Binet - used for preschool thru adulthood and has higher ceilings and lower floors. TANgible

Wechsler - measures verbal and performance IQ separately and the psychometric properties are a huge advantage. Focuses on Weaknesses and strengths

Base rate vs hit rate

Base rate - extent to which a trait exists within a population you are going to generalize to.

Hit rate - proportion of people that your test/study accurately identifies as showing the trait that you were looking for.

Types of Drug Tests

- Urine screen - either with card onsite or sent to a lab
- Blood - most popular for accidents
- Saliva - gaining popularity
- Sweat - to collect over time
- Hair - residues encased in hair shaft

Types of Drug Tests

Urine drug screen – either with a test card

for onsite evaluation or sent to a lab

Blood test – most popular for accidents

Saliva – becoming more popular

Sweat – used to collect data over time

Hair test – when using drugs, residues are

encased in the hair shaft

Cognitive Ability

- Often used when making employment related decisions
- Found to be valid predictors of future performance
- Group differences in performance make this controversial

Cognitive Ability

Cognitive-based tests are often used when

making employment related decisions

• Often found to be valid predictors of

future performance

Group differences in performance on

cognitive tests make this practice

controversial

Five Stages of Test Development

- Test conceptualization
- Test construction
- Test tryout
- Analysis
- Revision

Likert Scales

Taker is presented with 5 alternative responses on some continuum

Likert scale

-typically 5-7 alternative responses on a continuum

Guttman Scales

Items range from weaker to stronger expressions of variable being measured

Arranged so that agreement with a stronger statement implies agreement with milder statements as well.

Guttman Scale

-Items range from weaker to stronger expressions of variable

-Arranged so agreement with stronger statements implies agreement with milder statements

-Produces ordinal data

-Produces ordinal data

Guttman Scale

Items range from weaker to stronger

expressions of variable being measured

„

Arranged so that agreement with stronger

statements implies ag

reement with milder

statements as well

„

Produces ordinal data

Guttman scaling

-agreement with stronger statement implies agreement with milder statements

Guttman scaling

-items range from weaker to stronger expressions of the variable being measured

-arranged so that agreement with stronger statements implies agreement with milder statements as well

-produces ordinal data

Thurston Scaling Method

Designed for developing a "true" interval scale for psychological constructs

Start with a large item pool

Testtakers choose items to match beliefs

Individual's score is based on judge's rating.

Thurstone Scaling Method

-Process designed for developing 'true' interval scale for psychological constructs

>>Start with large item pool

_Get ratings of items

_Items selected using statistical evaluation

_Get ratings of items

_Items selected using statistical evaluation

>>Test-takers choose items to match beliefs

_Individual score based on ratings

_Individual score based on ratings

Thurstone Scaling Method

Process designed for developing a ‘true’

interval scale for psychological constructs

Start with a large item pool

• Get ratings of the items from experts

• Items are selected using a statistical

evaluation of the judges ratings

„

Testtakers chooses items to match beliefs

• Individual’s score is based on judges’

ratings

Choosing your Item Type

- Selected response items take less time to answer; used when breadth of knowledge is being assessed.
- Constructed response items more time consuming to answer; used to assess depth of knowledge.
- Selected response item scoring is more objective

Item reliability index

Product of the item-score standard deviation and the correlation between the item score and the total test score.

Provides an indication of the tests internal consistency. The higher the index, the higher the consistency.

Item-reliability index

-the item reliability index is the product of the item-score standard deviation and the correlation b/w the item score and the total test score

-provides an indication of the tests internal consistency. the higher the index, the higher the consistency

S.A.T & Reliability

Lawsuit due to "high moisture content" Puts reliability into question

Reliability is not:

-Are we measuring what we intended?

-Appropriateness of using info

-Appropriateness of using info

-Test bias

^^Concerned with validity!

^^Concerned with validity!

Reliability

Does the test produce consistent

measurement results?

Reliability is Not:

Reliability is not concerned with:

• Are we measuring what we intended to

measure?

• The appropriateness of how we use

information

• Test bias

„

These are issues for validity!

Important note on reliability coefficients

Evidence refers to particular group from particular sample, not the test itself

Reliability Coefficient Formula

Indicates proportion of true variance divided by total variance

R=o2(true)/o2(total)

o2(total)=o2(true)+o2(error)

total variance = true variance + error variance

R=o2(true)/o2(total)

o2(total)=o2(true)+o2(error)

total variance = true variance + error variance

Reliability and Error

Sources of error determine reliability estimate to choose & each coefficient affected by different sources of error

Goal = use measure that best addresses error for test

Goal = use measure that best addresses error for test

Reliability and Error

Sources of error help us determine which

reliability estimate to choose

• Each coefficient is affected by different

sources of error

„

Goal is to use the re

liability measure that

best addresses the sources of error

associated with a test

Reliability and error

-sources of error affect which reliability estimate is important

-each coefficient is affected by different sources of error

-goal: use the reliability measure that best addresses the sources of error associated with a test

Test Construction Error

Item or content sampling (differences in wording or selected content)

Error produced by variation in items within or between a test

Has to do with how & what behaviors are sampled

Test Construction Error

Item or content sampling

• Differences in item wording or how

content is selected may produce error

„

This error is produced by variation in items

within a test or between different tests

„

May have to do with:

• How behaviors are sampled

• What behaviors are sampled

Administration Error

Environmental Factors

Test-taker factors

Test-taker factors

Examiner factors

Administration Error

Anything that occurs during the

administration of the test that could affect

performance

Environmental factors: temperature,

lighting, noise, how comfortable the chair

is, et cetera

Test-taker factors: mood, alertness, errors

in entering answers, et cetera

Examiner factors: physical appearance,

demeanor, nonverbal cues, et cetera

Administration error

-anything that occurs during the administration of the test that could affect performance

-environmental factors: temp., lighting, noise, how comfortable the chair is, etc.

-test-taker factors: mood, alertness, errors in entering answers, etc.

-examiner factors: physical appearance, demeanor, nonverbal cues, etc.

Scoring and Interpretation Error

Subjectivity of scoring

Problem with:

-non-objective personality tests

-essay tests

-behavioral observations

-computer scoring

Problem with:

-non-objective personality tests

-essay tests

-behavioral observations

-computer scoring

Scoring and Interpretation Error

Subjectivity of scoring is a source of error

variance

„

More likely to be a problem with:

• non-objective personality tests

• essay tests

• behavioral observations

• computer scoring errors?

Scoring and interpretation error

-subjectivity of scoring is a source of error variance

-more likely to be a problem with: non-objective personality tests; essay tests; behavioral observations

Test-Retest Reliability

Same test administered twice to same group with time interval in between

*Coefficient of Stability* is calculated by correlating two sets of results

*Coefficient of Stability* is calculated by correlating two sets of results

Test-Retest Reliability

The same test is admi

nistered twice to the

same group with a ti

me interval between

administrations

„

Coefficient of Stability is calculated by

correlating the two sets of test results

Test-retest reliability

-the same test is administered twice to the same group with a time interval b/w administrations

-coefficient of stability is calculated by correlating the two sets of test results

Test-Retest Sources of Error

-Stability of Construct

-Time

-Time

-Practice Effects

-Fatigue Effects

-Fatigue Effects

Test-Retest Sources of Error

There are multiple sources

of error that impact

the coefficient

of stability:

• Stability of the construct

• Time

• Practice effects

• Fatigue effects

Test-retest sources of error

-there are multiple sources of error that impact the coefficient of stability:

-stability of the construct

-time/maturation

-practice effects

-fatigue effects

Parallel or Alternate Forms Reliabilty

Two different versions of a test that measure the same construct

P=same mean & variance

A=don't meet equal means & variance

*Coefficient of Equivalence* is calculated by correlating two forms of test

P=same mean & variance

A=don't meet equal means & variance

*Coefficient of Equivalence* is calculated by correlating two forms of test

Parallel-Alternate Error

-Motivation & fatigue

-Events b/t two administations

-Item selection error

-Item selection error

Parallel-Alternate Error

here are multiple sources of error that can

impact the coefficient of equivalence:

• Motivation and Fatigue

• Events that happen between the two

administrations

• Item selection will also produce error

„

Used most frequently

when the construct is

highly influenced by practice effects

Parallel-alternate error

-there are multiple sources of error that can impact the coefficient of equivalence:

-motivation and fatigue

-events that happen b/w the two administrations

-item selection will also produce error

When Parallel-Alternate is used most frequently

When the construct is highly influenced by practice effects

Inter-rater or Inter-scorer Reliability

-Represents degree of agreement between multiple scores (or judges, raters, etc.)

Inter-rater or inter-scorer reliability

-represents the degree of agreement (consistency) b/w multiple scorers (or judges, raters, observers, etc.)

-calculated with pearson r (or Spearman rho depending on the scale)

-training procedures and standardized scoring criteria increase consistency

Method used to calculate Inter-rater reliability

Pearson r or Spearman rho depending on the scale

Requirements for Inter-rater reliability

Proper training procedures and standardized scoring criteria

Internal Consistency

-Measure of consistency within the test >> How well do the items -correlate with each other?

-The degree to which all items measure the same construct

-The degree to which all items measure the same construct

Internal Consistency

A measure of consistency within the test

• How well do all of the items “hang

together” or correlate with each other?

The degree to which a

ll items measure the same construct

Three ways to measure

• Split-Half (w/ Spearman-Brown)

• Kuder-Richardson (KR-20 & KR-21)

• Cronbach’s Alpha

Internal consistency

-a measure of consistency within the test: how well to all of the items "hang together" or correlate with each other?

-homogeneity: the degree to which all items measure the same construct

3 Ways to Measure Internal Consistency

-Split-Half (w/ Spearman Brown)

-Kuder-Richardson (KR-20 & KR-21)

-Cronbach's Alpha

-Cronbach's Alpha

Ways to measure internal consistency

1. split-half (with Spearman-brown)

2. Kuder-Richardson (KR-20 & KR-21)

3. Chronbach's Alpha

Split-Half Reliability (1 of 2)

Simplest way to calculate internal consistency

Steps:

Test items split in half

Scores of each half are correlated

Correlation-Coefficient is corrected using *Spearman-Brown formula*

Steps:

Test items split in half

Scores of each half are correlated

Correlation-Coefficient is corrected using *Spearman-Brown formula*

Split-Half Reliability

Simplest way to calculate internal

consistency

„

The following steps are used:

• Test items are split in half

• the scores of each half of the test are

correlated

• correlation coefficient is corrected using

the Spearman-Brown formula

Spearman-Brown (2 of 2 for Split-Half)

Used to estimate the reliability of a test that has been shortened or lengthened

r_{sb} = n*r_{xy }/[1+(n-1)*r_{xy}]

r

Calculating "n" for Spearman-Brown

-Have 300 items, want a test with only 100

n= 100/300 = .33

-Have 10 items and want to add 30

n= 30/10 = 3

n= 100/300 = .33

-Have 10 items and want to add 30

n= 30/10 = 3

Spearman-Brown Examples

Look over and practice!

Kuder-Richardson Formulas (Split-Half)

Two types: KR-20 & KR-21

Statistic of choice with dichotomous items (yes & no)

Statistic of choice with dichotomous items (yes & no)

Kuder-Richardson w/ hetero items

KR-20 yields a lower estimate of r

Coefficient Alpha (Split-Half)

-Developed by Lee J. Cronbach

-Mean of all possible split-half correlations corrected by Spearman-Brown

-Mean of all possible split-half correlations corrected by Spearman-Brown

-Ranges from 0 to 1 (1 indicates higher reliability)

-Most popular reliability coefficient with psychological research

-Most popular reliability coefficient with psychological research

Calculating Coefficient Alpha

-Extension of KR-20

-Appropriate for non-dichotomous (considers variance of individual items)

a = [k/(k-1)]*[1-(Eo^{2}_{i})/o^{2}]

a = [k/(k-1)]*[1-(Eo

k = # of items

o2_{1}= variance of one item

o2

o2=variance of total test scores

Power Test

Test that has items that vary in their level of difficulty

Most test-takers will complete but no get all the answers right

„

Power Test

a test that has items that that

vary in their level of difficulty

• Most test-takers will complete the test

but will not get all items correct

can use the all of the regular

reliability coefficients

Speed Test

Test where all items are equal in difficulty

Test-takers will get answers right but won't finish

Speed Test

a test where all items are

approximately equal in difficulty

• Most test-takers will get the answers

right, but will not finish the test

reliability must be based on

multiple administrations:

• Test-Retest Reliability

• Alternate-Forms Reliability

• Split-Half (special formula used)

Power vs. Speed

Power test - can use all of the regular reliability coefficients

Speed Test - reliability must be based on multiple administrations

Criterion vs. Norm

Norm - traditional reliability methods are used

Criterion - reflect material that is mastered hierarchically

^^ Reduced variability in scores which reduces estimates

Criterion - reflect material that is mastered hierarchically

^^ Reduced variability in scores which reduces estimates

Norm vs. Criterion

Norm-referenced test: good item = score high and get it right, score low and get it wrong

Criterion-referenced test: items need to assess mastery of concepts

^^pilot comparison w/ & w/out mastery to asses items

^^pilot comparison w/ & w/out mastery to asses items

Generalizability Theory

-Developed by Lee J. Cronbach

-Alternate view of Classical Test Theory (based on domain sampling theory)-No 'true' score

-Suggests test's reliability is function of circumstances under development, administration, and interpretation

-Score varies depending on environmental conditions called 'facets'

Generalizability Theory

Developed by Lee J. Cronbach

„

In this theory there is no “true” score

„

Instead, a person’s score on a given test will

vary across administrations depending upon

environmental conditions

„

These environmental conditions are called

facets

Impact of Facets on Test Scores

Facets include: # of items on test, training of test scorers, purpose of test

-If all facets are the same, should expect the same score

-If facets vary, scores should vary

-If all facets are the same, should expect the same score

-If facets vary, scores should vary

Impact of Facets on Test Scores

Facets include: number of items in the test,

training of test scorers, purpose of the test

administration

If all facets in the environment are the same

across administrations,

we would expect the

same score each time

„

If the facets vary, then we would expect the

scores to vary

Applications of Generalizability Theory

-All possible scores from all possible combinations = *Universe Score*

-Provides practical info to make decisions:

What situations will this be reliable?

What facets impact the test the most?

What situations will this be reliable?

What facets impact the test the most?

Generalizability vs. True Score

True-score theory doesn't identify different characteristics on observed score

True-score theory doesn't differentiate finite sample of behaviors from universe of behaviors

Generalizability theory describes conditions (facets) over which one can generalize scores

True-score theory doesn't differentiate finite sample of behaviors from universe of behaviors

Generalizability theory describes conditions (facets) over which one can generalize scores

Standard Error of Measurement

-Important for test interpretation of individual scores

-Provides estimate of amount of error in observed score or measurement

-Based on True-Score theory

-Inverse relation with reliability

-Used to estimate extent to which observed deviates from true score

-Provides estimate of amount of error in observed score or measurement

-Based on True-Score theory

-Inverse relation with reliability

-Used to estimate extent to which observed deviates from true score

Standard Error of Measurement

Important for test interpretation of

individual scores

Provides an estimate of the amount of error

inherent in an observed score or

measurement

Based upon True-Score Theory

Inverse relation with reliability

Used to estimate the extent to which an

observed score deviates from a true score

Standard error of measurement

-SEM is an estimate of measurement precision

-high reliability = small standard deviation of scores = small SEM

SEM in Relation to Classical Test Theory

Observed score = true score + error

SEM - method of estimating amount of error in test score

^^is a function of reliability of test and variability of test scores

SEM - method of estimating amount of error in test score

^^is a function of reliability of test and variability of test scores

Reliability and SEM

High reliability = highly consistent results

Leads to small standard deviation and SEM

Reliability and SEM are *inversely related*

Leads to small standard deviation and SEM

Reliability and SEM are *inversely related*

Reliability and SEM

If we have high reliability then we

would expect highly consistent results

• Because those results are consistent,

that standard deviation of possible

scores would be small

• And the SEM would be small

„

Reliability and SEM are inversely

related

True Score Estimates

-The observed score will be the best estimate of true score (not exact due to measurement error).

-Standard error of measurement considers observed test scores as indicative of a potential range of scores

-Standard error of measurement considers observed test scores as indicative of a potential range of scores

True Score Estimates

The observed score will be the best estimate

of the true score

„

But because of measurement error, it is

not an exact indicator

„

The standard error of measurement

forces us to think of observed test scores

as indicating a potential range of scores

for the individual

95% Confidence Interval

True score falls between 100 +/- 1.96o

99% Confidence Interval

True score falls between 100 +/- 2.58o

Standard Error of the Difference

-Used when making comparisons between scores

- Determines how large a different should be before it becomes statistically significant

- Determines how large a different should be before it becomes statistically significant

Standard Error of the Difference

„

A statistical measure to determine how

large a difference should be before it is

considered statistically significant.

Standard error of the difference

-in practice, the SEM is most frequently used in the interpretation of an individual's test scores

-another statistic, the standard error of the difference (sigma_{diff}) is better when making comparisons b/w scores

-scores b/w people, tests, or two scores from the same person over time

Validity

General term referring to a judgment over how well a test measures what it claims to measure

Validity

• Does the test measure effectively what it

purports to measure?

Validity

Validity is a general term referring to a judgment regarding how well a test measures what it claims to measure

Validity is similar to reliability – it is not an

all-or-none characteristic of a test

„Validity statements refer to the degree of

appropriateness of inferences

„Is the validity of the test sufficient to make the use of the test worthwhile for this person or these people at this time , under these circumstances?”

Validity and Context

-Not an all-or-none characteristic

-Refers to degree of appropriateness

-Refers to degree of appropriateness

Face Validity

-Does the test look like it measures what it's supposed to?

-Has more to do with test-taker, rather than test-user

-Has more to do with test-taker, rather than test-user

-Psychometric soundness doesn't require face validity (& vice-versa)

Face Validity

„

Does the test look like it measures what it is

supposed to measure?

Has to do more with the judgments of the

test TAKER, not the test user

„

Psychometric soundness does not require

face validity (and vice-versa!)

„

Why should we be concerned with it?

Face validity

-does the test look like it measures what it is supposed to measure?

-has to do more with the judgments of the test TAKER, not the user

-not a statistical issue

Content Validity

-Judgment of how adequately a test sample's behavior represents behavior it was designed to sample

Content Validity

A judgment of how adequately a test

samples behavior representative of the

behavior that it was designed to sample.

„

When investigating quantitative skills, what

types of questions should you include?

„

What about verbal skills?

Content Validity Steps

1 - Precise definition of construct being measured

2 - Domain sampling is used to determine behaviors that might represent the construct

3 - Determine adequacy of domain sampling

^^Use Lewshe's Content Validity Ratio (CVR)

^^Use Lewshe's Content Validity Ratio (CVR)

Content Validity Ratio (Step #3)

Experts rate items:

a. Essential to construct

b. Useful but not essential

c. Not necessary

a. Essential to construct

b. Useful but not essential

c. Not necessary

{Values range from -1 to +1

Negative: less than half "essential"

Zero: half "essential"

Positive: more than half "essential"}

>>Items kept if agreement exceeds chance

Negative: less than half "essential"

Zero: half "essential"

Positive: more than half "essential"}

>>Items kept if agreement exceeds chance

Criterion-Related Validity

-Criterion: standard against which a test or test score is evaluated

-Typically relevant, uncontaminated & something that can be measured reliably

-Typically relevant, uncontaminated & something that can be measured reliably

Criterion-related validity

-criterion: the standard against which a test or a test score is evaluated

-a good criterion is generally:

-relevant

-uncontaminated

-something that can be measured reliably

Concurrent Validity

An index of the degree to which a test score is related to a criterion measure obtained at the same time

Concurrent Validity

An index of the degree to which a test score

is related to a criterion measure obtained at

the same time

„

Example:

• New Depression Inventory correlated .89

with Beck Depression Inventory scores

• Correlated .85 with psychiatric diagnosis

of Major Depressive Disorder

Predictive Validity

An index of the degree to which a test score predicts scores on future criterion

Predictive Validity

An index of the degree to which a test score

predicts scores on some future criterion

„

Examples

• SAT or ACT scores predicting college

GPA

• Personality and ability measures

predicting job performance

• Interests and values predicting job

satisfaction

Assessing Criterion Validity

-Validity Coeffient - correlation between scores on the test and some criterion measure

-Pearson's r is usual measure, but may need others depending on scale

Assessing criterion validity

-validity coefficient: typically a correlation coefficient between scores on the test and some criterion measure (r_{xy})

-Pearson's r is the usual measure, but may need to use other types of correlation coefficients depending on the data scale

-can also use "expectancy tables" for categorical criterion

Interpreting Validity Coefficients

Can be severely impacted by restriction of range effects

Incremental Validity

-Does this test predict any additional variance than what has been previously predicted by another measure?

-Different formulas for un/related variables

-Different formulas for un/related variables

Incremental Validity

Incremental validity

-does this test predict any additional variance than what has been previously predicted by some other measure?

Construct Validity

The process of determining the appropriateness of inferences drawn from test scores measuring a construct

*Umbrella Validity*

*Umbrella Validity*

Construct Validity

Construct Validity is the process of

determining the appropriateness of

inferences drawn from test scores

measuring a construct

Construct validity

-construct: unobservable underlying trait hypothesized to describe or explain behavior

-construct validity is the process of determining the appropriateness of inferences about the construct drawn from test scores

-formulate and test hypotheses derived from theories about the nature of the construct

Construct Validation Definition

Takes place when an investigator believes his instrument reflects a particular construct, which are attached certain meanings

This interpretation generates specific testable hypothesis, which are means for confirming or disconfirming the claim

Construct Validation Parts

-Hypothesize about how construct relates to observables and other constructs

-Prediction of what these inter-relations should be like is based on a theory

-Evidence for a construct is obtained by the accumulation of many findings

Evidence for Construct Validity

-Test is homogenous

-Test scores:

Increase or decrease as predicted

Increase or decrease as predicted

Vary by group as predicted

Correlate with other measures as predicted

Evidence for Construct Validity

The test is homogenous, measuring a single

construct

„Test scores increase or decrease as

theoretically predicted

„

Test scores vary by group as predicted by

theory

„

Test scores correlate with other measures as

predicted by theory

Homogeneity Evidence

*Uniformity*

-Do subscales correlate with total score?

-Do individual items correlate with subscale or total scale scores?

-Do all of the items load onto a single factor using a factor analysis?

-Do subscales correlate with total score?

-Do individual items correlate with subscale or total scale scores?

-Do all of the items load onto a single factor using a factor analysis?

Change Evidence

-If construct is hypothesized to change over time, these changes should be reflected by either stability of lack of stability

-Will the construct change after intervention?

-Will the construct change after intervention?

Group Difference Evidence

Predicted differences should fit with theory

Convergent Validity

-Does our measure highly correlate with other tests designed for or assumed to measure the same construct?

-Doesn't have to measure the exact construct, similar ones are OK!

-Doesn't have to measure the exact construct, similar ones are OK!

Discriminant Validity

-Measure should not correlate with measure of dissimilar constructs

-It's a problem it measures correlate highly with measures that it shouldn't

Multitrait-Multimethod Matrix

-Both convergent and discriminant validity can be demonstrated using Matrix

-Multitrait - must include two or more traits in analysis

-Multimethod - must include two or more methods of measuring construct

Expectancy Data

-Additional Info that can be used to help establish the criterion-related validity of a test

-Usually displayed using an expectancy table that judges how a test-taker will score within an interval of scores on a criterion measure

Base Rate

Extent to which a particular trait, behavior, characteristic, or attribute exists in the population

„

Base Rate

Extent to which a particular

trait, behavior, characteristic, or attribute

exists in the population

proportion of the population

who will meet the criteria

Hit rate

Proportion of people accurately identified as possessing or exhibiting some characteristic

Miss Rate

Proportion of people the test fails to identify as having or not having a particular characteristic

False Positive

Predicted had characteristic, but didn't

False positive

False Negative

Predicted didn't have trait, but actually did

False negative

-a miss wherein the test predicted that the test taker did not possess the characteristic when that person did

Selection Rate

-Proportion of population selected

-As validity increases, selection ratio will be improved over the base rate

-With a small selection rate, even a small increase in validity will help

TD Step #1: Test Conceptualization

Could be stimulated by anything:

societal trends, personal experiences, etc.

TD Step #2: Test Construction

-Scaling is the process of selecting rules for assigning numbers to measurement of varying amounts of some trait

No best way to assign numbers for all types of traits, but may be optimal method for construct you want to measure

Likert Rating Scale

Take presented with 5 alternative responses on a continuum

Extremes assigned scores of 1 and 5

Generally reliable

Result in ordinal-level data

Summative scale

Method of Paired Comparisons

Taker presented with two test stimuli and are asked to make comparison

Method of Paired Comparisons

taker

presented with two test stimuli and are

asked to make some

sort of comparison

Sorting Tasks

-Takes asked to order stimuli on the basis of some rule

-Categorical or Comparative

-Categorical or Comparative

Sorting Tasks

takers asked to order

stimuli on the basis of some rule

• Categorical - placed in categories

• Comparative - placed in an order

Selected Reponse

-Take less time to answer

-Assess breadth of knowledge

-Assess breadth of knowledge

-More objective (& therefore more reliable)

Constructed Response

-More time consuming

-Assess depth of knowledge

-Assess depth of knowledge

Item Type Disadvantages

Table 7-1 in book

Test Construction: Writing Items

-Write twice as many items for each construct than intended

-ITEM POOL=reservoir of potential items that may or may not be used

Test Construction: Scoring Items

-Decisions about scoring items are related to scaling methods used when designing the test

-Three options: cumulative, class/categoral, & ipsative

-Three options: cumulative, class/categoral, & ipsative

TD Stage #3: Test Tryout

-Should use participants and conditions that match the test's intended use

-Initial studies should use 5 or more participants for each item

Guessing & Faking

-Guessing only issue for tests with 'correct anwer.'

-Faking can be issue with attitudes: Faking good & bad

-Faking can be issue with attitudes: Faking good & bad

Guessing and faking

-guessing is only an issue for tests where a "correct answer" exists

-not an issue when measuring attitudes

-faking can be an issue with attitudes

-faking good: positive self-presentation

-faking bad: malingering or trying to create a less favorable impression

-random responding

Guessing Correction Methods

-Verbal or written instructions to discourage guessing

-Penalties for incorrect answers

-Not counting omitted answers as incorrect

Faking Corrections

Lie scales

-Social desirability scales

-Fake good/bad scales

-Infrequent response items

-Total score corrections based on scores obtained from measure of faking

-Using measures with low face validity

TD Step #4: Item Analysis

-Good test made of good items: reliable & valid!

-Good items help discriminated between test-takers on the basis of some attribute

-Item Analysis differentiates good from bad

Item Analysis - Basic Procedures

-Vary depending on goals

-May include enhancing

-May include enhancing

forms of reliability,

certain forms of validity &

discrimination

Item Difficulty Indices

Test with right and wrong answers ideally have takes who are highest on the attribute to get more items correct than those who are not high on the attribute

Whether or not you get a right or wrong answer should be based on differential standings

Item difficulty indices

-known as item-endorsement index in other contexts

-the proportion of the total number of test takers who got the item right

Item Difficulty Index

Proportion of total number of test-takers who got the item right

Item difficulty index

-proportion of total test takers who get item right

Ideal Average

Halfway between chance guessing and 1.0

4 option MC: (.25+1)/2 = .625

5 option MC: (.20+1)/2=.60

T/F = (.50+1)/2=.75

Ideal average

-ideal average (p_{i}) is halfway b/w chance and guessing at 1.0

Item-Total Correlation

A simple correlation between the score on an item and total test score

Can test statistical significance of the correlation & interpret % of variability item accounts

Can test statistical significance of the correlation & interpret % of variability item accounts

Item-total correlation

-a simple correlation b/w the score on an item and the total score

-advantages: can test statistical significance of the correlation; can interpret % of variability item accounts for (rit2)

Item-Validity Index

Does the item measure was it's supposed to?

Often evaluated using latent trait theory

Often evaluated using latent trait theory

Evaluated using confirmatory factor analysis

Item-Discrimination Index Goal

-If discrimination between those with high & low on some construct is the goal, we could want items with higher proportions of high scorers getting the item correct and lower proportions of low scorers getting the item correct

Item-Discrimination Index

Used to compare performance on item in upper and lower test scores

Regions with 25-33% of sample yield best results

Contrast number who got item correct in upper and lower ranges

Contrast number who got item correct in upper and lower ranges

Item-discrimination index

-symbolized by d: compares proportion of high scorers getting item "correct" and proportion of low scores getting item "correct"

-higher positive values indicate item passed by more examinees in the upper group, while negative values indicate more from lower group passed the item

Item-Discrimination Index Formula

d - compares proportion of high and scorers getting item correct

d=[U-L]/n

Higher positive values indicated item passed by more examinees in upper group; negative values indicate lower group passing

Empirically Keyed Scales (Item-Discrimination Method 2)

-Choose items that produce differences between the groups that are better than chance (on basis of discrimination)

-Scales often have heterogenous content and limited range of interpretation

-Used in clinical settings for diagnosis of mental disorders

TD Step #5: Test Revision

-Modifying the test stimuli, administration, etc. on the basis of either quantitative or qualitative item analysis

-Each items have strengths and weaknesses >> balance!

Cross Validation

-Re-establishing the reliability and validity of the test with other samples

-May be conducted by developer or researcher with interest

-Must use different sample than one used in other stages

Item Fairness

-Unfair if it favors on particular group

-Results in systematic differences between groups not due to construct

-Persons showing same ability as measured should have same probability of passing any given item that measures the ability

What is Personality?

· The most adequate conceptualization of a person’s behavior in all its detail” (McClelland, 1951)

· Consistent behavior patterns and intrapersonal processes originating within the individual” (Burger, 1997)

· “an individual’s unique constellation of psychological traits and states” (C&S)

What is personality?

-the relatively distinctive and stable patterns of behavior that characterize an individual and their reactions to the environment

-3 common components: focus on individual diff's; the individual diff's are relatively stable; usually refer to intrapersonal processes of emotions, motivations and cognitive processes

Traits:

· any relatively enduring characteristic of an individual that distinguishes that person from another

o Example: extraversion, Introversion, openness, contentiousness

States

· a temporary, or transient presentation of some personality trait or disposition

o Examples: anxious, calm, fearful, embarrassed, happy, sad etc.

Types

· are divided as unique sets of traits and states that are similar in pattern to an identified category of personality within a taxonomy of personalities

· Not all typologies are based on psychological theories with an empirical basis

the four temperaments

Sanguine:

Choleric

Melancholic

Phlegmatic

The Four Temperaments

Galen - 190 AD

Sanguine - blood, spring, air

Choleric - yellow bile, summer, fire

Melancholic - black bile, autumn, earth

Phlegmatic - phlegm, winter, water

Sanguine

· temperament of the blood, season of spring and the element of air

· associated with functioning of the liver (blood) makes a person optimistic and cheerful

Choleric

Yellow bile, summer, fire

· associated with the spleen, easily angered, bad tempered and controlling

Melancholic

black bile, autumn, earth

· associated with the gall bladder; perfectionistic, depressive

Phlegmatic

·phlegm, winter, water

associated with the lungs, calm and unemotional

Six (modern) approaches to Personality

1. Psychoanalytic

2. Trait

3. Biological

4. Humanistic

5. Behavioral/social learning

6. Cognitive

Psychoanalytic Approach

· Unconscious minds are largely responsible for important differences in behavior styles

Trait Aproach

· people can be described along a continuum of various personality characteristics

Biological approach

· Inherited predispositions and physiological processes explain individual differences

Humanistic Approach

· keys to individual differences are degree of personal responsibility and self acceptance

Behavioral/ social learning Approach

· consistent differences are the result of conditioning and expectations

Cognitive approach

How are Personality assessments used?

· Employment matching

· Adjustment issues for decisions about military service

· Academic opportunities

· Employment mobility

· Diagnoses, or degree of impact from some trauma

· Inform treatments

· Research and validation of theory

How are Personality Assessments Used?

-Employment matching

-Military service

-Academic opportunities

-Employment mobility

-Diagnoses

-Inform treatments

-Research and Validation of theory

Assessment Methods

· Interviews

· Self report to written questions

· Card sorts (q-sort)

· Responses to ambiguous stimuli

· Interviews or responses of friends, family, spouse, teacher, coworkers, etc.

· Case histories

· Ratings by judges or experts

Assessment Methods

-Interviews

-Self-report

-Card sorts (Q)

-Responses to ambiguous stimuli

-Interviews or responses of friends, family

-Case histories

-Ratings by judges or experts

Objective Tests:

· Paper and Pencil or computer aided

o Choose a response from options that represent various characteristics of personality

· Procedures for scoring require little judgment

· Allows for implementation of a variety of validity indices

Advantages of Objective tests

o can be answered and scored quickly (scored reliably)

o Breadth of content

o Can be administered or groups or individuals

Advantages of objective tests

-can be answered quickly

-can be administered by computer

-can be scored quickly and reliably

-can be administered in groups or individually

-procedures for scoring require little interpretation

-allows for implementation of a variety of validity indices

Disadvantages of objective tests

o Assuming honesty and capacity/insight to answer questions accurately.

Psyche

Greek word for ‘the mind’

Metric

Latin word for ‘measurement

Psychometrics

the science of

measuring psychological phenomena

Psychometrics

Measuring the mind

Fundamental goal of psychological measurement

to predict behavior

Two Types of Measurement

scaling

classification

Tests are defined by what (and how)

they measure

Content

Format

Administration procedures

Scoring and interpretation procedures

Psychometric quality – what makes an

effective test?

adequate Norms

Was the test developed using samples

similar to the people taking the test?

Origins of Testing: Early Psychology: Darwin

Origin of the Species

raised issue of individual differences

• Provides theoretical basis for animal models in

medical and psychological testing

Origins of Testing: Early Psychology: Wilhelm Wundt

German medical doctor who studied how individuals were similar instead of different (Leipzig School)

• Described human abilities with respect to

reaction time, perception and attention span

Testing in the U.S.

U.S. military developed Army Alpha &

Beta during WWI (Yerkes & Brigham)

• Used to identify intellectual abilities of

recruits and personality risk factors for

“shell shock”

In 1939 the Wechsler-Bellevue Intelligence

Scale (now WAIS) developed for adults.

• Later other versions were developed for

use with preschool (WPPSI) and school

age children (WISC).

History suggests cultural bias in testing can

have an adverse impact:

Immigration restrictions

• Forced sterilization

Federal Testing

Legislation

Public interest in educational testing sparked by

Sputnik (1957)

National Defense Education Act (1958) provided

money for aptitude testing in attempt to identify

gifted children

Increased use of tests led to concerns about value

and effect of psychological testing on students

Discrete (scale)

– categorical labels or integers, no

meaningful middle grounds between

categories

Discrete scale

-categorical labels or integers, no meaningful middle grounds between categories

-e.g. 1-single, 2-married

Continuous (scale)

numbers do not represent

categories, middle ground between units

possible

Scales of measurement

Scales, or levels, of measurement help determine

what statistical analyses are appropriate

Enable test users to make accurate score

interpretations

Scales of Measurement

Scales, or levels, of measurement help determine what statistical analyses are appropriate

Scales of measurement

-scales, or levels, of measurement help determine what statistical analyses are appropriate

-enable test users to make accurate score interpretations

-four levels:

-nominal

-ordinal

-interval

-ratio

**NOIR**

Nominal Scale

Nominal (or Naming) Level

Lowest level of measurement

Ordering is not important, only the label

attached to designate a mutually exclusive

and exhaustive category

Examples:

• Medical Diagnoses

• Gender

• Political party affiliation

Nominal Scale

-Lowest level

-Order not important, only label attached to designate a mutually exclusive and exhaustive category

Nominal scale

-nominal (AKA naming) level

-lowest level of measurement

-ordering is not important, only the label attached to designate a mutually exclusive and exhaustive category

-examples:

-medical diagnoses

-gender

-political party affiliation

Nominal Scale Statistics

If numbers are assigned, they cannot be

meaningfully manipulated mathematically

Appropriate arithmetic operations:

• counting

• proportions

• percentages

• chi-square tests

Nominal Scale Statistics

If numbers are assigned, they cannot be meaningfully manipulated mathetmatically

Ordinal Scale

Individuals or things are ranked or ordered

on the basis of some criteria

Intervals between ranks are not consistent

Examples:

• Grade level

• Ranking from shortest to tallest

• Gold, Silver, Bronze

• Movie sequels

Ordinal Scale

-Individuals or things are ranked or ordered on the basis of some criteria

-Intervals between ranks are not consistent

Ordinal Scale Statistics

Values imply nothing about magnitude of differences between one level to the next

numbers are not units of measurement

statistical operations are limited to non parametric tests

Ordinal scale statistics

-values imply nothing about magnitude of differences between one level to the next

-the numbers do not indicate units of measurement

-no absolute zero point; the ways in which data form ordinal scales are limited

Interval Scale

Numbering includes order, but intervals

between each successive level represents

equal differences

No absolute zero point in the scale

Examples:

• Fahrenheit Scale

• Intelligence Test Scores

Interval Scale

-Numbering includes order, but intervals between each successive level represents equal differences

-No absolute zero point in the scale

Interval Scale Statistics

Because of equal intervals between values

some mathematic operations are

meaningfully appropriate:

• Addition and Subtraction

• Multiplication and Division not

appropriate because the is no true zero

• Statistical tests based on mean scores

and/or variance

Interval Scale Statistics

Because intervals are equal, some mathematic operations are meaningfully appropriate

Interval scale statistics

-because of equal intervals between values some mathematic operations are meaningfully appropriate:

-addition and subtraction, statistical tests based on mean scores and/or variance

Ratio Scale

Includes ordering, equal intervals AND an

absolute zero

Examples:

• length and weight

• Kelvin scale

All mathematical operations can be

meaningfully performed

Ratio Scale

-Includes ordering, equal intervals, and an absolute zero

-All mathematical operations are meaningfully performed

Describing Data

Three methods:

Pictorially

• Measures of Central Tendency

• Measures of Variability (or Dispersion)

Error

Deviation for some measurement from the

true standing of an individual on some

characteristic

Error

the part of the score that deviates from

that true standing on the construct

Error

-Deviation for some measurement from the true standing

-Influences central tendency and variability

Many sources of error:

Many sources of error:

• effects of the environment

• precision of the measurement device

• confounding variables

Error influences estimates of both central

tendency and variability

Sources of Error

Errors in Test Construction

„

Errors in Test Administration

„

Errors in Test Scoring and Interpretation

Central Tendency

measures are used to

describe the typical response seen in a

sample of observations

Range

Range is the difference between the highest

and lowest scores

„

Is sensitive to outliers

A: 2, 5, 7, 7, 8, 8, 10, 12, 15, 17, 20

Range = 18

B: 2, 2, 2, 3, 4, 4, 5, 5, 5, 6, 6, 20

Range = 18

Range

-the difference between the highest and lowest scores

-sensitive to outliers

Deviation Scores

Measure of how far the raw score is from the

mean of its distribution (X –

μ)

Variance

is the average of the sum of the

squared deviations of each score from the

mean

Variance

-the average of the sum of the squared deviations of each score from the mean

_

s^{2 }= ^{1}/n-1 Σ(Xi - X)^{2}

standard deviation

is the square root of

the variance

• is expressed in the same units of

measurement as the original scores

Standard deviation

Square root of the variance

Standard deviation

-the average deviation of each score from the mean

-the standard deviation is the square root of the variance

-expressed in the same units of measurement as the original scores

Positive skew

:

only a few extremely

high scores and many low scores

Positive skew

-only a few extremely high scores and many low scores

-tail goes to the right

Negative skew:

only a few extremely

low scores and many high scores

Negative skew

-only a few extremely low scores and many high ones

-tail goes to the left

Quartiles

Dividing points between the four quarters of

a distribution of test scores

Quartiles

-Dividing points between the four quarters of a distribution

-Interquartile = difference between Q3 and Q1

-Semi-interquartile = divide by 2

Interquartile range

is equal to the

difference between Q3 and Q1

Semi-interquartile range

equals the

interquartile range divided by 2

Z-Scores and Percentile Ranks

Z-scores can be used to calculate percentiles

when raw scores have a normal distribution

„

When used in conjunction with a Z-Table,

the z-score reveals th

e area of the normal

distribution below the score in question

Z-scores and percentile ranks

-z-scores can be used to calculate percentiles when raw scores have a normal distribution

-when used in conjunction w/ a Z-table, the z-score reveals the area of the normal distribution below the score in question

-the Z-score table that indicates the proportion of the total number of scores that fall into a certain range of z-scores

Percentile Pros and Cons

Advantages: can be used to interpret

performance in terms of various groups and

are easily understood

Disadvantages: units are not equal on all

parts of the scale

• Percentiles are an ordinal scale

• Differences between individuals near the

middle are magnified

and differences at

the extremes are compressed

Percentile Pros and Cons

Advantage:

Interpret performance in various groups and easily understood

Disadvantages:

Units are not equal on all parts of the scale

-Differences between individuals near the middle are magnified while extremes are compressed

Interpret performance in various groups and easily understood

Disadvantages:

Units are not equal on all parts of the scale

-Differences between individuals near the middle are magnified while extremes are compressed

Interpreting Percentiles

A percentile difference of 10 near the

middle of the group often represents a

smaller difference in performance than a

difference of 10 near the tails

In terms of skills, a difference of a few

percentile points near the tails means more change has taken place than the same size difference near the middle of the group

Positive Relations (correlation)

„ Strong Relation (r = .7 or

higher)

• Height and Weight

• Age and Job Experience

„ Moderate/Weak Relation (r =

.4 or lower)

• Chemotherapy and Cancer

remission

• GRE Scores and Grad

student success

R2 (r squared)

proportion of variance

shared by variables

Negative Relations

Strong Negative (r = -.7)

• Political Affiliation and

Willingness to vote for

another party’s candidate

„ Moderate/Weak Negative

Relation (r = - .4)

• Brushing teeth and

cavities

Negative Relations

See Positive

Negative relations

-strong negative: (r=-.7)

-moderate/weak negative: (r = -.4)

Spearman’s Rho (Ρ or ρ)

Used if sample sizes are small OR

„

If Ordinal Scale data is used

Simple Linear Regression

is used when one variable is used to predict values

Describes the relationship between one

Independent Variable (X) and one

Dependant Variable (Y)

Least-Squares approach is used to minimize

the differences between observed and

predicted scores

Regression line is the straight line which

comes closest to the greatest number of

points on the scatterplot of X and Y

Simple Linear Regression

-Describes relationship between one independent and one dependent variable

-Least-Squares approach is used to minimize the difference between observed and predicted scores

-Regression line is the straight line with comes closest to the greatest number of points on the scatterplot of X and Y

Logistic Regression

is used when the

variable being predicted is dichotomous

(ex. gender

Issues with Prediction

How do we deal with the fact that the

predictors (X) and the variable to be

predicted (Y) are often on different scales of

measurement?

„

Prediction technique must take into account

both the scales of measurement and the

correlation between the two variables

„

Linear regression does just that!

Issues with Prediction

-Must take into account both of the scales of measurement and the correlation between the two variables

Standard Error of the

Estimate (SE)

Indicates magnitude of

errors in estimation

„ Higher correlations

produce smaller SE

„ Lower correlations

produce larger SE

Norm-Referenced Evaluation

A way of interpreting test scores by

comparing an individual’s results to the

scores of a group of test takers

„

Interpretation is relative

„

Alternative is Criterion-Referenced

Evaluation

Norm-referenced Evaluation

-A way of interpreting test scores by comparing an individual's results to the scores of a group of test takers

-Interpretation is relative

Norm-referenced evaluation

-a way of interpreting test scores by comparing an individual's results to the scores of a group of test takers

-interpretation is relative

-percentiles

-"top 5%"

Normative Sample

A group of people whose performance on a

particular test is analyzed for reference in

evaluating the performance of individual

test-takers.

„

Sample must be representative or typical of

the intended population of interest

„

Inadequate norms makes it difficult to make

proper interpretations

Sampling and Norms

Test administered to members of the sample

under the same conditions

• Environment

• Instructions

• Time restrictions

• Et cetera

„

Developers calculate descriptive statistics

„

Provide precise description of sample

Normative Sample

-A group of people whose performance on a particular test is analyzed for reference in evaluating the performance of individual test-takers

-Must be representative of intended population

Normative sample

-the group of people whose performance on a particular test is analyzed for reference in evaluating the performance of individual test-takers

-a normative sample must be representative or typical of the intended population of interest

-diff's need to be proportionately represented in the sample

-e.g. gender, race/ethnicity

Types of Samples

Stratified

„

Random

„

Purposive

„

Convenience

Types of Samples

Random - each individual has equal chance of being included

Purposive - Arbitrarily selecting a sample because it represents some population

Convenience - yeah...

Stratified Samples

Sampling individuals from subgroups in the

population in the same proportion as the

population they are part of

„

Best when population includes subgroups

that differ on some potentially meaningful

characteristic

„

Helps prevent sampling bias

Stratified Samples

-Sampling individuals from subgroups in the population in the same proportion as the population they are part of

-Best when subgroups differ on meaningful characteristic

-Helps prevents sampling bias

Stratified samples

-sampling individuals from subgroups in the population in the same proportion as the population they are part of

-provides greater precision than a simple random sample of the same size

-can guard against the "unrepresentative" sample

Random samples

each individual from the

population has an equal chance of being

included in the sample

Random sampling

-each individual from the population has an equal chance of being included in the sample

-true random sampling is very rare in practice (time & $, ethics, self selection)

-contrast w/ random assignment (random assignment of participants in the selected sample to different experimental conditions)

Purposive Samples

arbitrarily selecting a sample

because it is believed to represent some

population

Convenience Samples

a sample that is convenient

or available for use

Convenience sampling

-a sample that is convenient or available for use

-ISU psychology participant pool

Types of Norms

Percentiles

„

Age

„

Grade

„

National

„

Anchor

„

Subgroup

„

Local

Age and Grade Norms

Average performance of test-takers at

various ages/grades

Scores do not represent equal units of

measurement

„

Scores often (incorrectly) used as

evaluative standards

„

Not effective with very young or adult

test-takers

Age and grade norms

-average performance of different samples of test-takers at various ages/grades

-scores often used as evaluative standards for one's performance on a test (e.g. below average, average)

-concept of "mental age"

National and Anchor Norms

National norms are derived from

‘representative’ samples of a country

• Often developed using stratified sampling

methods

„

Anchor norms indicate how test scores for a

measure compare to

the norms for other

measures of the same construct

• Calculated using percentile scores

National & Anchor Norms

National norms are derived from 'representative' samples (often stratified)

Anchor norms indicate how test scores for a measure compare to the norms for other measures of the same construct

Sub-Group and Local Norms

Sub-Group norms are created by selecting

sub-groups from the normative sample

• Limited by the sampling techniques used

to create the normative sample

„

Local Norms can also be developed for a

measure for use in a specific area

• Most useful in cases where the national

norms may not represent the local

population

Sub-group & Local norms

-Sub-group norms are created by selecting sub-groups from the normative sample

-Local norms can be developed for a measure for use in a specific area

Norm-Referenced (NRT)

Interpretation is

based on an individual’s

relative standing

in

some known group

Criterion-Referenced (CRT):

Interpretation

is based on measuring an individual’s skill

level in relation to a clearly specified

standard (i.e., criterion)

Limitations

It is important to understand that normed scores do not represent standards or goals to be achieved by students

• Norms simply describe typical or normal performance

„Criterion Reference Scores may have little

or no application at the upper end of the

knowledge/skill continuum

• More difficult to make proper

comparisons between test takers

Types of Reliability

Test-Retest Reliability

Parallel or Alternate Forms Reliability

Inter-Rater or Inter-Scorer Reliability

Split-Half and other Internal Consistency

measures

Choice of reliability measure depends on the test’s

design and other logistic considerations

„

Parallel Forms –

Two different versions of a

test that measure the same construct

• Each form has the same mean and variance

Alternate Forms

Two different versions of

a test that measure the same construct

• Tests do not meet the equal means and

variances criterion

Inter-Rater or Inter-Scorer

Represents the degree of agreement (consistency) between multiple scorers

• or judges, raters, observers, etc.

Calculated with Pearson r

• or Spearman rho depending on the scale

„

Proper training procedures and standardized

scoring criteria are needed to produce

consistent results

Spearman-Brown

This can be used to estimate the reliability of a test that has been shortened or lengthened

„

Generally, reliability increases as the length

of the test increases, assuming the

additional items are of good quality

Note: n is equal to the number of items in the version you want to know the reliability

for, divided by the number of items in the

version you have the reliability for

Kuder-Richardson Formulas

Two types: KR-20 & KR-21

„

Statistic of choice with dichotomous

items (e.g., “yes” or “no”).

„

When items are more heterogeneous,

KR-20 yields a lower estimate of r

rKR20

Coefficient Alpha

First developed by Lee J. Cronbach

„Can be interpreted as the mean of all

possible split-half correlations, corrected by

the Spearman-Brown formula

„

Ranges from 0 to 1 with values closer to 1

indicating greater reliability

„

Most popular reliability coefficient with

psychological research

Coefficient Alpha

-can be interpreted as the mean of all possible split-half correlations, corrected by the Spearman-Brown formula

-Ranges from 0 to 1 with values closer to 1 indicating greater reliability

-"generally acceptable" values are .70-.90

-coefficient alpha above .90 may be "too high;" indicating redundancy

Homogenous vs. Heterogeneous

Test Construction

If a test measures only one construct, then

the content of the test is homogenous

If multiple constructs are measured, the test

content is heterogeneous

• With a heterogeneous test, a global

measure of internal consistency will

under-estimate reliability (compared to

test-retest)

• But internal consistency measures can be

calculated separately for each construct

Dynamic vs. Static

Static traits do not change much

„

Dynamic traits or states are those

constructs that can change over time

• May be fairly stable (not as easily

changeable or more durable over time)

• But may also have quick changes from

one moment to another

Restriction of Range

Sampling procedures may result in a

restricted range of scores

Test difficulty may also result in a restricted

range of scores

If the range is restricted, the reliability

coefficients may not reflect the true

population coefficient

• Just like what happens with a correlation

coefficient

Applications of Generalizabilty

Theory

All possible scores from all possible

combinations of environment facets is

called the universe score

„

This provides more practical information to

be used in making decisions:

• In what situations will the test be

reliable?

• What are the facets that most impact test

reliability?

Evidential Validity

Face

• Content

• Criterion (Concurrent, Predictive)

• Construct (Convergent, Discriminant)

• Relevance and Utility

Consequential Validity

Appropriateness of use determined by

consequences of use

Criterion

the standard against which a test

or a test score is evaluated

„

No strict rules exist about what can be used,

so it could be just about anything

„

A good criterion is generally:

• Relevant

• Uncontaminated

• Something that can be measured reliably

Criterion

-a standard on which a judgment or decision is made

Validity Coefficient

typically a correlation

coefficient between scores on the test and

some criterion measure (rxy)

Step 1: Test Conceptualization

Could be stimulated by anything:

• societal trends

• personal experience

• a gap in the literature

• need for a new tool to assess a construct

of interest

„

Example: The International Personality Item

Pool

Step 2: Test Construction

Scaling is the process of selecting rules for

assigning numbers to measurement of

varying amounts of some trait, attribute, or

characteristic

„

No best way to assign numbers for all types

of traits, attributes, or characteristics

• But there may be an optimal method for

the construct you want to measure!

Likert (& Likert-type)

taker presented

with 5 alternative responses on some

continuum (may also use 7 point scale)

„

Extremes assigned scores of 1 and 5

• Are generally reliable

• Result in ordinal-level data

• Summative scale

Test Construction: Choosing Your

Item Type

Selected response items generally take less

time to answer and are often used when

breadth of knowledge is being assessed.

Constructed response items are more time

consuming to answer and are often used to assess depth of knowledge.

„

Selected response item scoring is more

objective (therefore, more reliable) than the

scoring of constructed response items.

ITEM POOL

is a reservoir of potential

items that may or may not be used on a test

• If the pool has good content coverage, in

order for good content validity in the test,

the test must represent that pool

Stage 3: Test Tryout

After we have designed a test and have

developed a pool of items, we need to

determine which are the best

„

Should use participants and conditions that

match the test’s intended use

„

Rule of thumb is that initial studies should

use five or more participants for each item

in the test

Guessing

is only an issue for tests where a

“correct answer” exists

Faking

an be an issue with attitudes

• Faking Good: Positive self-presentation

• Faking Bad: Malingering or trying to

create a less favorable impression

Step 4: Item Analysis

A good test is made up of good items

• Good items are reliable (consistent)

• Good items are valid (measure what they

are supposed to measure)

• Just like a good test!

Good items also they help discriminate

between test-takers on the basis of some

attribute.

Item Analysis is used to differentiate good

items from bad items

Step 5: Test Revision

Modifying the test stimuli, administration,

etc., on the basis of either quantitative or

qualitative item analysis.

„

Each item will have strengths and

weaknesses

• Goal is to balance of these strengths and

weaknesses for the intended use and

population of the test.

What is Intelligence?

Intelligence is a very general mental

capability that, involves the ability to

reason, plan, solve problems, think

abstractly, comprehend complex ideas,

learn quickly and learn from experience.

Reflects a broader and deeper capability for

comprehending our surroundings than

simple academic learning

Intelligence

-Very general mental capacity that involves the ability to reason, plan, solve problems, think abstractly, comprehend complex ideas, learn quickly and learn from experience

Key Historical Figures (1)

Jean Esquirol was the first scientist to make

distinction between mental incapacity and

mental illness in 1938

Frances Galton wroteHereditary Genius in

1869 and set forth the idea of inherited

mental characteristics

• Set up anthropometric labs in 1884 to test

people (head size, grip strength,

perceptual discrimination, etc.)

Key Historical Figures (2)

James McKeen Cattell coined the term

“mental test” in 1890 and described 50

measures that primarily assessed sensory

and motor abilities

Alfred Binet developed a diagnostic test in

1905 to identify mental retardation in

school aged children

• Differential abilities at different ages, the

idea of “mental age” vs. “physical age”

Piaget’s Developmental Theory

Intelligence is a product of complex

interactions of biology and environment

• result in a reorganization of biological

and psychological structures allowing for

more flexible interactions in our

environment

Development in stages associated in

changes in schema complexity

Schemas are cognitive frameworks for how

we interpret the environment

Piaget's Developmental Theory

-Intelligence is a product of complex interactions of biology and environment

-Development in stages associated with changes in schema complexity

Assimilation

fitting information into

existing schemas

Accommodation:

existing schemas are

modified to consider a new information

and experiences

Spearman’s Two-Factor Theory

g factor

(general intelligence):• g

involves mental activities such as

deductive reasoning, comprehension, and

hypothesis testing

specific factors:

• Behaviors influenced by these complex

mental activities (e.g., recognition, recall,

visual-motor abilities, etc.)

Spearman's Two-Factor Theory

-g-factor (general intelligence): involves mental activities such as deductive reasoning, etc.

-specific factors: behaviors influenced by these complex mental activities

Thurston’s Multidimensional Theory

of Intelligence

Did not believe in g

– used factor analysis

to state that there are several primary mental

abilities, including:

• Verbal ability, Perceptual speed,

inductive reasoning, rote memory,

deductive reasoning, spatial, etc.

Later research showed modest correlations

between these abilities

Thurston's Multidimensional Theory of Intelligence

-Several primary mental abilities: Verbal ability, perceptual speed, etc.

-3 components: operations, products, contents

-120 factors

Fluid Intelligence

Nonverbal, mental efficiency, adaptive

and new learning capabilities, related to

mental operations and processes

• Figural analysis, number and letter

series, concept formation, general

reasoning

More dependent on cortical and lower

cortical regions

Increases until adolescence and then

gradually decreases throughout life

Crystalized Intelligence

acquired skills and knowledge, well established cognitive functions

influenced by formal and informal learning

Campione’s Model of

Information Processing Theory

Focus on ways individuals mentally

represent and process information.

• Includes Architectural Factors

• Executive Processes

Campione's Model of Information Processing Theory

-Focus on ways individuals mentally represent and process information

Architectural Factors

Bio-genetically based properties

necessary for information processes

linked to perceptual skills.

• Capacity

• Durability

• Efficiency

Relatively impervious to improvement by

environment.

Architectural Factors

-Bio-genetically based properties necessary for information processes linked to perceptual skills

-Relatively impervious to improvement by environment

Executive Processes

Environmentally learned components that

guide problem solving

• Knowledge Base

• Schemas

• Control Processes

• Metacognitions

Componential/Analytic

Information-processing components

• Metacomponents: higher-order processes

used in planning, monitoring, and

evaluating the performance on a task.

• Performance: strategies employed to

execute a task

• Knowledge Acquisition: processes used

in learning new things (e.g., selective

encoding)

Experiential/Creative

Involves the individual’s knowledge of both

internal and external environments

• Involves how one copes with

tasks/situations in these environments,

which are dependent upon experience

• New situations are novel and require

novel strategies

• With more experience those strategies

become automatic

Contextual

How an individual adapts to, selects, and

shapes his/her environment

When a strategy fails:

• adapt to the new environment

• select a new environment where the

strategy still works

• shape the environment to fit with the

strategy

Strategies may be culture bound

Contextual

-How an individual adapts to, selects, and shapes her environment

Nature vs. Nurture

Intelligence is most likely the result of an

interaction between genetic endowment and

environmental influences

Historically there was a strong adherence to

genetic pre-determinism

• Galton’s Hereditary Genius, Goddard’s

work, Eugenics movement, Immigration

restrictions

Nature vs. Nurture

-Intelligence is most likely the result of an interaction between genetic endowment and environmental influences

-Historically, strong adherence to genetic pre-determinism

Support for Nature

Heritability Index is approximately .50

(e.g., Plomin & Defries, 1980)

Twins studies support genetic component in

intelligence based on degree of similarity

Heritability indices actually increase with

age, but does this mean that genes have

more influence later in life?

Support for Nature

-Heretibility index is .50

-Twin studies support genetic component in intelligence

Support for Nature

-Perinatal factors associated with lower IQ

-Malnutrition associated with lower functioning

- Correlation between IQ and school

-Flynn Effect - Increase of 3 IQ every decade

Support for nature

-twins raised apart support genetic component in intelligence based on degree of similarity

-though not as similar as if raised together

-children adopted from mothers with higher IQs tend to have higher IQs, irrespective of adoption family's SES

-though those in higher SES have higher IQs

Support for Nurture

Perinatal factors resulting low birth weights tend

to be associated with lower IQ

Malnutrition during first 40 weeks associated

with lower intellectual functioning

SES correlates about r = .33, strongest effects

seen with persistent poverty

Significant correlations between IQ and years of

school completed, truancy.

Flynn Effect – increase of about 3 IQ points per

decade from 1940 to 1990

Stability of Intelligence

Difficult to predict intelligence later in life

from infant measures

Some support for intelligence scores

becoming more stable when children reach

school age

May experience some decline late in life

due to neurological deterioration.

Stability of Intelligence

-Difficult to predict later in life

-Stable during school years

-May decline late in life

Educational Assessment Factors

Age of test-taker

Purpose of assessment

Choice of response format

High vs. Low stakes

Impact on ‘teaching’

Tools of Preschool Assessment

Screening instruments used to identify

children who are “at risk”

• Documented difficulties in one or more

psychological, social, or academic areas

• Format is primarily observer-based

(checklists or rating scales)

Other measures and psychological tests

Checklist

questionnaire provides a list of

behaviors, thoughts, events, etc., each is

marked if it is present

• Can be filled out by individual or an

evaluator

• Achenbach Child Behavior Checklist

Rating Scale

Evaluator fills out a form

with a list of characteristics, provides scores

indicating relative standing

• Connors Rating Scale

Other Preschool Measures

Temperament

Family Environment

Parenting/Caregiving

Childhood Sexual abuse

Personality measures

Psychological Tests (for young children)

Before age 2 assessment is based on the

presence of developmental milestones

Tests must be specifically designed for the

age-range being assessed

• Colorful, engaging materials

• Easy to administer

• Clear start/discontinue rules

• Allow for behavioral observation

Performance Assessment

is a general term

for tasks that are more complex than

traditional survey item

Portfolio assessment

involves the review

of a collection of work samples

Portfolio Assessment

Involves the review of a collection of samples

Authentic Assessment

emphasis on

learning that can be transferred to real

world setting

Diagnostic Tests

Evaluative tests are used to make judgments

(ex. S.A.T.)

Diagnostic tests are used to assess level of

functioning for remedial purposes

• Reading Tests

• Math Tests

• Other Tests

Diagnostic Tests

-Evaluative tests are used to make judgement

-Assess level of functioning for remedial purposes

Psychoeducational Test Batteries

Kaufman Assessment Battery for Children

(K-ABC)

The Differential Ability Scales (DAS)

The Woodcock-Johnson III (WJ-III)

Metropolitan Readiness Test (MRT)

Achievement tests

are designed to measure

what students have learned

• Curriculum-Based Assessment

• Curriculum-Based Measurement

Aptitude tests

focus more on the capacity to

learn, often claim to measure things that

cannot be taught or coached

• sometimes called prognostic tests

K-ABC

Intelligence test that assesses two basic

types of information-processing skills

K-ABC

-Intelligence test that assesses two basic types of info-processing skills

-Sequential Skills: Following a set of rules, basic math operations, grammar

-Simultaneous Skills: Recognizing letters and number, interpreting overall meanings

Sequential Skills

Following a set of rules,

basic mathematical operations, memorizing

lists of spelling words, rules of grammar

Simultaneous Skills

Recognizing letters

and numbers, interpreting the overall

meaning of pictures, comprehension of

scientific and mathematical principles

Differential Ability Scales

Based on the British Ability Scales, but

normed on U.S. students ages 2-18

Assumes a developmental, hierarchical

model of intelligence

• Designers do not use term intelligence

• Assumes that certain abilities are only

present at certain ages

Factor structure of the measure changes

with the age of test takers

Woodcock-Johnson-III

Two batteries (Achievement and Aptitude)

based on the Cattell-Horn-Carroll theory of

cognitive abilities

Regular and extended batteries are available

when more diagnostic information is needed

Age-based norms available for ages 24

months to 19 years

Woodcock-Johnson-III

-Two batteries (Achievement and Aptitude) based on the Cattell-Horn-Carroll theory of cognitive abilities

High-Stakes Testing

A test is ‘High-Stakes’ when important

decisions are made based on test results

• Students in poor school districts tend to

struggle more on these tests

• Leads to increased grade retention and

dropping out

• Teaching to the test effects

• Drives out good teachers

High-Stakes Testing

-When important decisions are made based on results

-Students in poor districts tend to struggle

-Leads to increased grade retention and dropping out

-Teaching to the test effects

-Drives out good teachers

Physical Tests

Certain types of jobs require physical skills

and capacities that can only be evaluated

using physical tests

• Typically criterion referenced

Drug testing is becoming more common

across a wide range of occupations

• Performance enhancing vs. ‘recreational’

Physical Tests

-Certain types of jobs require physical skills and capacities that can only be evaluated this way

-Typically criterion referenced

-Drug testing is becoming more common

Ability and Aptitude

GATB – General Aptitude Test Battery

• Cognitive (General Learning Ability,

Verbal and Numerical Aptitude)

• Perceptual (Spatial Aptitude, Form and

Clerical Perception)

• Psychomotor (Motor coordination, Finger

and Manual Dexterity

ASVAB – Armed Services Vocational

Aptitude Battery

Ability and Aptitude

GATB - General Aptitude Battery

>Cognitive, Perceptual, Psychomotor

-ASVAB - Armed Services Vocational Aptitude Battery

Job Satisfaction

• Satisfied workers are more productive,

more consistent, and complain less

Organizational commitment

Committed workers make stronger

contributions to the company

Organizational culture

Measures of climate can help improve

working conditions

Consumer Psychology

Branch of social psychology that deals with

why people buy products:

• Does a market exist for this new product?

• How can we make people more aware of

this product or its uses?

• How can we persuade people to buy this

product?

Consumer Psychology

-Branch of social psychology that deals with why people buy products

Occupational Scales

Original Developed for 1927

Strong

now includes 211 scales representing 109

occupations

Based on data collected from incumbents

• Provide information about how an

individual’s responses compare with

those of people actually in and satisfied

with a particular occupation

Occupational Scale

Based on data collected from incumbents

-Provides info about how an individual's responses compare with those of people actually in and satisfied with a particular occupation

Testing vs. Assessment

Measuring variables to obtain a sample of behavior vs. integrating data for making an evaluation

Chinese civil service exams

Origin of psychological testing

Origin of the Species

Raised issue of individual differences

Scales and Descriptive Statistics

Data must be measured on an interval or ratio scale for computations to be valid

Ordinal scale = use median but not mean

Central Tendency vs. Variability

Central tendency measures are used to describe typical response, whereas variability describe fluctuation in scores from sample

Central tendency vs. variability

-central tendency measures are used to describe the typical response seen in a sample of observations

-variability measures are used to describe how much fluctuation in scores there are in a sample of observations

-we need both to interpret a person's score

Variance and Standard Deviation

-Reflects the variability of scores about the mean of the group

Variance and standard deviation

-both variance and standard deviation reflect the variability of scores about the mean of the group

-typical distance of a score from the mean

Population Z-score Formula

-Calculated by subtracting the population mean from the individual raw score and then dividing by the population standard deviation

Sample Z-Score Formula

Calculated by subtracting the sample mean from the individual raw score and then dividing by the sample standard deviation

Positive Relations

Strong: r=.7

Moderate: r=.4

r^{2}=proportion of variance shared by variables

Moderate: r=.4

r

Positive relations

-strong relation: r=.7 or higher

-moderate relation: r=.4 or lower

Spearman's Rho

-Used if sample sizes are small or if ordinal scale data is used

Regressions

Simple Linear Regression - one variable used to predict values

Multiple Regression - multiple predictors are used

Logistic Regression - dichotomous variable

Error and Prediction

SE = Standard Error of the Estimate: indicates magnitude of errors in estimation

-Higher correlations produce smaller SE & lower produces larger

-Higher correlations produce smaller SE & lower produces larger

Error and prediction

-standard error of the estimate (SE)

-indicates magnitude of errors in estimation

-higher correlations produce smaller SE

-lower correlations produce larger SE

Comparisons of NRT and CRT

NRT - covers broad content and emphasizes discrimination

CRT - focuses on a more specific content and emphasizes description

CRT - focuses on a more specific content and emphasizes description

Organ Functioning and Personality (4 Temp)

Sanguine - liver, optimistic and cheerful

Choleric - spleen, easily angered, controlling

Melancholic - gall bladder, perfectionistic, depressive

Phlegmatic - lungs, calm and unemotional

Six Approaches to Personality

Psychoanalytic - unconscious mind

Trait - continuum

Biological - inherited and physiological

Humanistic - responsibility and self-acceptance

Behavioral - conditioning and expectations

Cognitive - process information

MBTI - theory based approach

-Four bi-polar dimensions

-Introvert-extrovert - Flow of energy is inward of outward

-Sensing-intuition - Perception is from senses or insights

-Thinking-feeling - Decisions are objective (rational) or subjective (emotional)

-Judging-perceiving - Works step-by-step or impulsively

Lexical Hypothesis and Personality

LH = Key personality concepts are encoded in language

-Data reduction (Factor Anaylsis) used to determine structure

-Supports five-factor model

How do we measure honesty?

Validity Scales - Gauge degree of honesty in response pattern by comparing scores to a criterion group

-Infrequent responses/lie scales & Impression management

Measuring Response Style

-Consistency of response/fatigue

-Response set

Criterion Groups

Criterion group - reference sample with some characteristic to be compared to a general sample

Also known as contrasted groups design, or empirically-keyed scales

Also known as contrasted groups design, or empirically-keyed scales

Criterion group

-a group of test-takers who share specific characteristics and whose responses serve as "standard"

Empirical vs. Theory

-Empirically-keyed scales are heterogenous

-Theory based scales are homogenous

MMPI

-Most widely used psychological test

-Developed by Hathaway and McKinley

-Originally designed to assist with diagnosis of psychiatric disorders

MMPI

-most widely used psychological test in the world

-developed by Hathaway and McKinley in the late 1930s and early 1940s

-university of Minn. hospital and persons w/in community

-originally designed to assist w/ the diagnosis of different psychiatric disorders

-at one time was popular for use in employment screening

MMPI-2

-items revised, removed, replaced

-norm: 1138 males and 1462 females b/w 18 & 80 from several regions and divers communities w/in the US

-increased attention to "non-pathological" interpretation

Normal Personality

NEO & MBTi are used

Understands clients' strengths, weaknesses, and interpersonal style

Understands clients' strengths, weaknesses, and interpersonal style

Some common misconceptions about intelligence tests

-Measure innate ability

-Fixed and cannot change over time

-Tell all we need to know

-Measure same underlying capacity

-Scores from different measures are interchangable

Jean Esquirol

First scientist to make distinction between mental incapacity and mental illness

Frances Galton

wrote Hereditary Genius

Psychometric Intelligence

Intelligence is first and foremost your ability to do well on an intelligence

Assimilation and Accomodation

Ass - fitting info into existing schemas

Acc - existing schemas are modified to consider new info

Acc - existing schemas are modified to consider new info

Fluid Intelligence (Cattell & Horn)

-Nonverbal, mental efficiency, adaptive and new learning capabilities

-More dependent on cortical and lower cortical regions

-Increases until adolescence and then gradually decreases throughout life

Crystallized Intelligence (Cattell & Horn)

-Acquired skills and knowledge, well-established cognitive function

-Influenced by formal and informal learning

-Increases through lifespan

-Contingent upon fluid intelligence

Experiential/Creative (Sternberg)

Involves the individual's knowledge of both internal and external environments

-Involves coping with tasks/situations

-New situations are novel and require novel strategies

-With more experience they become automatic

Comparing Theories of Intelligence

Points of agreement:

-Knowledge-based thinking

-Apprehension

-Adaptive Purposeful Striving

-Fluid-Analytic Reasoning

-Mental Playfulness

-Idiosyncratic Learning

Comparing Intelligence Theories