- StudyBlue
- Statistics

Jennifer S.

Data

Collections of observations (such as measurements, genders, survey responses).

Statistics

The science of planning studies and experiments, obtaining data, and then organizing, summarizing, presenting, analyzing, interpreting, and drawing conclusions based on the data.

Advertisement

Population

The complete collection of all individuals (scores, people, measurements, and so on) to be studied. The collection is complete in the sense that it includes all of the indivudals to be studied.

Census

The collection of data from every member of the population.

Sample

A subcollection of members selected from a population.

Statistical thinking

Must consider:

Context of the data;

Source of the data;

Sampling method;

Conclusions; and

Practical implications

Key Questions

What is the context of the data?

What is the source of the data?

How were the data obtained?

What can we conclude from the data?

Based on statistical conclusions, what practical implications result from our analysis?

Guidelines for critically evaluation a statistical study

1)Identify the studies goal, population considered, & study type. 2) Consider the source, esp. w/regard to possibility of bias. 3)Analyze sampling method. 4)Look for problems defining/measuring variables of interest. 5) Watch for confounding variables that could invalidate conclusions. 6) Consider setting & wording of any survey. 7) Check that graphs represent data fairly, & conclusions are justified. 8)

Consider if conclusions achieve study's goals, make sense, & have practical significance.

Some basic principles of ethics

All subjects in a study must give their informed consent.

All results from individuals must remain confidential.

The well-being of study subjects must always take precedence over the benefits to society.

Parameter

A numerical measurement describing some characteristic of a population.

Statistic

A numerical measurement describing some characteristic of a sample.

Advertisement

Consist of numbers representing counts or measurements.

categorical (or qualitative or attribute) data

Consist of names or labels that are not numbers representing counts or measurements.

Discrete data

Result when the number of possible values is either a finite number or a "countable" number. (that is, the number of possible values is 0 or 1 or 2, and so on.).

continuous (numerical) data

Result from infinitely many possible values that correspond to some continuous scale that covers a range of values without gaps, interruptions, or jumps.

Nominal level of measurement

Characterized by data that conists of names, labels, or categories only. The data cannot be arranged in an odering scheme (such as low to high).

Ordinal level of measurement

Data are at the ordinal level of measurement if they can be arranged in some order, but differences (obtained by subtaction) between data values either cannot be determined or are meaningless.

Interval level of measurement

The interval level of measurement is like the ordinal level, with the additional property that the difference between any two data values i smeaningful. However, data at this tlevel do not have a natural zero starting point (Where none of the quanitity is present).

Ratio level of measurement

The interval level with the additional property that there is also a natural zero starting point (where zero indicates that none of the quantity is present). For values at this level, differences and ratios are both meaningful.

- nominal
- ordinal
- interval
- ratio

What is a voluntary response sample?

A voluntary response sample is one in which the subjects themselves decide whether to be included in the study.

Why is a voluntary response sample generally not suitable for a statistical study?

Because they often have a bias since those with a special interest in the subject are more likely to participate in the study.

What is the difference between statistical significance and practical significance?

Statistical significance is indicated when methods of statistics are used to reach a conclusion that some treatment or finding is effective, but common sense might suggest that the treatment or finding does not make enough of a difference to justify its use or to be practical.

You have collected a large sample of values. Why is it important to understand the context of the data?

Without understanding the context of the data you have no understanding of what the data represents.

[3 lb. loss after 12 mos.] Does the Weight Watchers weight loss program have a statistical significance? A practical significance?

Although the program appears to have statistical significance, it does not have practical significance because the mean loss of 3.0 lb. after one year does not seem to justify the program.

"We recruited study candidates from the Greater Boston area using newspaper advertisements and television publicity." Is the sample a voluntary response sample?

Yes because the respondents themselves decided to be included.

Super Bowl: THe New York Giants beat the Denver Broncos in the Super Bowl b a score of 120 to 98.

Possible but very un- likely.

Speeding Ticket: While driving to his home in Connecticut, David Letterman was ticketed for driving 205 mph on a highway with a speed limit of 55 mph.

Possible, but very unlikely.

Traffic lights: While driving through a city, Mario Andretti arrived at three consecutive traffic lights and there were all green.

Possible and likely

Thanksgiving: Thanksgiving day will fall on a Monday next year.

Impossible.

Supreme Court: All of the justices on the US Supreme Court have the same birthday.

Possible, but very unlikely.

Calculators: When each of 25 statistics students turn on his or her TI-84 Plus Calculator, all 25 calculators operate successfully.

Possible and likely.

Lucky Dice: Steve Wynn roled a pair of dice and got a total of 14.

Impossible.

Slot Machine: Wayne Newton hit the jackpot on a slot machine each time in ten consecutive attempts.

Possible, but very unlikely.

Ratio

There is a natural zero starting point and ratios are meaningful. Example: Distances

Interval

Differences are meaningful, but there is no natural zero starting point and ratios are meaningless. Example: Body termperatures in degrees Fahrenheit or Celsius.

Ordinal

Categories are ordered, but differences can't be found or are meaningless. Example: Ranks of colleges in U.S. News and World Report.

Nominal

Categories only. Data cannot be arranged in an ordering scheme. Example: Eye colors

How do a parameter and a statistic differ?

A parameter is a numerical measurement describing some characteristic of a population, whereas a statistic is a numerical measurement describing some characteristic of a sample.

How do discrete and continuous data differ?

Discrete data result when the number of possible values is either a finite number or a "countable" number (where the number of possible values is 0 or 1 or 2 and so on), but continuous data result from infinitely many possible values that correspond to some continuous scale that covers a range of values without gaps, interruptions or jumps.

Parameter vs. Statistic

In a large sample of households, the median annual income per household for high school graduates is $19,856 (based on data from the U.S. Census Bureau).

Statistic

Parameter vs. Statistic

A study of all 2223 passengers aboard the Titanic found that 706 survived when it sank.

Parameter

Parameter vs. Statistic

If the areas of the 50 states are added and the sum is divided by 50, the result is 196,533 square kilometers.

Parameter

Parameter vs. Statistic

The author measured the voltage supplied to his home on 40 different days, and the average (mean) value is 123.7 volts.

Statistic

Discrete or Continuous Data Set

In New York City, there are 3250 walk buttons that pedestrians can press at traffic intersections, and 2500 of them do not work. (based on..._

Discrete

Discrete or Continuous Data Set

The amount of nicotinein a Marlboro cigarette is 1.2 milligrams.

Continuous

Discrete or Continuous Data Set

In a test of a method of gender selection developed by the Genetics & IVF Institute, 726 couples used the XSORT method and 668 of them had baby girls.

Discrete

Discrete or Continuous Data Set

When a Cadillac STS is randomly selected and weighed, it is found to weigh 1827.9 kg.

Continuous

Nominal, Ordinal, Interval, Ratio

Voltage measurements from the author's home

Ratio

Nominal, Ordinal, Interval, Ratio

Critic ratings of movies on a scale from 0 star to 4 stars

Ordinal

Nominal, Ordinal, Interval, Ratio

Companies (Disney, MGM, Warner Brothers, Universal, 20th Century Fox) that produced the movies listed in Data Set 7 in Appendix B.

Nominal

Nominal, Ordinal, Interval, Ratio

Years in which movies were released, as listed in Data Set 9 in Appendix B.

Interval

Voluntary response sample (or self-selected sample)

One in which the respondents themselves decie whether to be included. (Do not use voluntary response sample data for making conclusions aobut a population.)

Examples of voluntary response samples which are seriously flawed

Polls conducted through the Internet, in which subjects can decide whether to respond; mail-in polls, in which subjects can decide whether to reply; and telephone call-in polls, in which newspaper, radio, or television announcements ask that you voluntarily call a special number to register your opinion.

Observational Study

We observe and measure specific characteristics, but we don't attempt to modify the subjects being studied.

Experiment

We apply some treatment and then proceed to observe its effects on the subjects (Subjects in experiments are called experimental units).

Simple Random Sample

A simple random sample of n subjects is selected in such a way that every possible sample of the same size n has the same chance of being chosen.

Random Sample

In a random sample members from the population are selected in such a way that each individual member in the population has an equal chance of being slected.

Probability Sample

Involves selecting members from a population in such a way that each member of the population has a known (but not necessarily the same) chance of being selected.

Systematic sampling

We select some starting point and then select every *k*th (such as every 50th) element in the population.

Convenience sampling

We simply use results that are very easy to get.

Stratified sampling

We subdivide the population into at least two different subgroups (or strata) so that subjects within the same subgroup share the same characteristics (such as gender or age bracket), then we draw a sample from each subgroup ( or stratum).

Cluster sampling

We first divide the population area into sections (or clusters), then randomly _{select} some of those clusters, and then choose all the members from those selected clusters.

Cross-sectional study

Data are observed, measured, and collected at one point in time.

Retrospective (or case-control) study

Data are collected from the past by going back in time (through examination of records, interviews, and so on).

Prospective (or longitudinal or cohort) study

Data are collected in the future from groups sharing common factors (called cohorts).

Confounding

Occurs in an experiment whe you are not able to distinguish among the effects of different factors. *Try to plan the experiment so that confounding does not occur.

Important considerations in the design of experiments:

Use randomization to assign subjects to different groups; use replication by repeating the experiment on enough subjects so that effects of treatments or other factors can be clearly seen; Control the effects of variables by using such techniques as blinding and a completely randomized experimental design.

Sampling error

The difference between a sample result and the true population result; such an error results from chance sample fluctuations.

Nonsampling error

Occurs when the sample data are incorrectly collected, recorded, or analyzed (such as by selecting a biased sample, using a defective measurement instrument, or copying the data incorrectly).

Spreadsheet

Collection of data organized in an array of cells arranged in rows and columns, and it is used to summarize, analyze, and perform calculations with the data.

Excel Worksheet

One page or sheet (or spreadsheet), consisting of cells arranged in an array of rows and columns. The cells can contain text, numbers, or formulas.

Center

A representative or average value that indicates weher the middle of the data set is located.

Variation

A measure of the amount that the data values vary.

Distribution

The nature or shape of the spread of the data over the range of values (such as bell-shaped, uniform, or skewed).

Outliers

Sample values that lie very far away from the vast majority of the other sample values.

Time

Changing characteristics of the data over time.

Frequency distribution (or frequency table)

Shows how a data set is partitioned among all of several categories (or classes) by listing all of the categories along with the number of data values in each of the categories.

Lower class limits

The smallest numbers that can belong to the different classes.

Upper class limits

The largest numbers that can belong to the different classes.

Class boundaries

The numbers used to separate the classes, but without the gaps created by class limits.

Class midpoints

The values in the middle of the classes. Each class midpoint is found by adding the lower class limit to the upper class limit and dividing the sum by 2.

Class width

The difference between two consecutive lower class limits or two consecutive lower class boundaries in a frequency distribution.

Frequency Distribution Procedure

Determine the # of classes. Calculate the class width. Choose either the minimum data value or a convenient value below the minimum data value as the 1st lower class limit (lcl). Add the class width to the 1st lcl to get the 2nd lcl - repeat. List the lcl(s) in a vertical column on the left & upper class limits on the right. Take each indiidual data value & put a tally mark in the appropriate class. Add tally marks to find the total frequency for each class.

Relative frequency distribution or percentage frequency distribution.

The frequency of a class is replaced w/ a relative frequency (a proportion) or a percentage frequency (a %). Relative frequency = class frequency/sum of all frequencies. Percentage frequency = (class frequency/sum of all frequencies) multiplied by 100%. The sum of the relative frequencies in a relative frequency distribution must be close to 1 (or 100%).

Cumulative frequency

The sum of the frequencies for that class and all previous classes.

Normal distribution

The frequencies start low, then increase to one or two high frequencies, then decrease to a low frequency. The distribution is approximately symmteric, with frequencies preceding the maximum being roughly a mirror image of those that follow the maximum.

Histogram

Used to analyze the shape of the distribution of the data. A graph consisting of bars of equal width drawn adjacent to each other (without gaps). The horizontal scale represents classes of quantitative data values and the vertical scale represents frequencies. The heights of the bars correspond to the frequency values.

Stemplot

Represents quantitative data by separating each value into two parts: the stem (such as the leftmost digit) and the leaf (such as the rightmost digit).

Bar graph

Uses bars of = width to show frequencies of categories of qualitative data. The vertical scale represents frequencies or relative frequencies. The horizontal scale identifies the different categories of qualitative data.

Multiple bar graph

Has two or more sets of bars, & is used to compare 2 or more data sets.

Pareto Chart

Used when you want to draw attention to the more important categories. A bar graph for qualitative data, with the added stipulation that the bars are arranged in descending order according to frequencies. The vertical scale represents frequencies or relative frequencies. The horizontal scale identifies the different categories of qualitative data. The bars decrease in height from left to right.

Pie chart

A graph that depicts qualitative data as slices of a circle, in which the size of each slice is proportional to the frequency count for the category.

Scatterplot (or scatter diagram)

A plot of paired (x, y) quantitative data with a horizonal x-axis & a vertical y-axis. The horizontal axis is used for the first (x) variable, & the vertical axis is used for the 2nd variable. The pattern of the plotted points is often helpful in determining whether there is a relationship between the two variables.

Important principles about graphs

Use a table for data sets of 20 values or less. A graph of data should make the viewer focus on the true nature of the data, not on other elements. Don't distort the data. Most of the ink in a graph should be used for data, not other design elements. Don't use slanted lines, dots, or crosshatching. Don't use areas or volumes for data that are actually 1-dimensional in nature. Never publish pie charts. They waste ink on nondata components, & lack an appropriate scale.

Nonzero axis

Some graphs are misleading because one or both of the axes begin at some value other than zero, so that differences are exaggerated.

Measure of center

A value at the center or middle of a data set.

Mean (or arithmetic mean)

The measure of center found by adding the data values & dividing the total by the number of data values.

Median

The measure of center that is the middle value when the original data values are arranged in order of increasing (or decreasing) magnitude. The median is often denoted by "x-tilde" (page 98).

Mode

The value that occurs most often or with the greatest frequency.

Midrange

The measure of center that is the value midway between the maximum & minimum values in the original data set. It is too sensitive & is rarely used.

Round-off rule for the Mean, Median, & Midrange

Carry one more decimal place than is present in the original set of values.

Skewed

Data extends more to one side than to the other.

Symmetric

The left half of its historgram is roughly a mirror image of its right half.

Skewed to the left (negatively skewed)

Has a longer left tail & the mean & median are to the left of the mode.

Skewed to the right (positively skewed)

Has a longer right tail, & the mean & median are to the right of the mode.

Range

Measure of variation. Range = maximum data value - minimum data value.

Round-off rule for measures of variation

Carry 1 more decimal place than is present in the original set of data.

Sample Standard Deviation

The measure of variation most commonly used in statistics. A set of sample values, denoted by s, is a measure of variation of vlaues about the mean calculated using:

s = √(∑[(x - ¯x)^{2}/(n-1)]

Population Standard Deviation

Variance

A measure of variation = to the square of the standard deviation.

Sample variance

Symbol = *s*^{2}

^{It is an unbiased estimate of the population variance. The values of s squared tend to target the value of sigma squared instead of systematically tending to overestimate or underestimate sigma squared. }

Population variance

Square of the population standard deviation. Sigma squared.

Range rule of thumb

The principle that for many data sets, the vast majority (such as 95%) of sample values lie within two standard deviations of the mean.

Minimum "usual" value = (mean) - 2 x (standard deviation).

Maximum "usual" value = (mean) + 2 x (standard deviation).

Properties of the standard deviation

It measures the variation among data values. Values close together have a small standard deviation & vice versa. The standard deviation has the same units of measurement as the original data values. For many data sets, a vlaue is unusual if it differs from the mean by more than two standard deviations. When comparing variation in 2 different data sets, compare the standard deviations only is the data sets use the csame scale & units & they have means that are approx. the same.

Empirical Rule for data with a bell-shaped distribution

About 68% of all values fall within 1 standard deviation of the mean. About 95% of all values fall within 2 standard deviations of the mean. About 99.7% of all values fall within 3 standard deviations of the mean.

Coefficient of variation (CV)

The CV for a set of nonnegative sample or population data, expressed as a %, describes the standard deviation relative to the mean, and is given by the following:

Sample:

CV = (s/xbar) multiplied by 100%

Population:

CV = (sigma/mu) multiplied by 100%

Z-score (or standardized value)

A measure of relative standing. The number of standard deviations that a given value x is above or below the mean. Formula:

Sample:

z = (x - xbar)/s

Population:

x = (x - mu)/sigma

Ordinary values: -2 is < or = to z score is < or = to 2

Unusual values: z score < -2 or z score > 2

Round-off rule for z Scores

Round z scoes to 2 decimal places (such as 2.46).

Percentiles

Measures of location, denoted by Psub# which divide a set of data into 100 groups with about 1% of the values in each group.

Percentile of value x = (number of values less than x / total number of values) multiplied by 100. Round to the nearest whole number.

Quartiles

Measures of location, denoted Qsub#, which divide a set of data into four groups with about 25% of the values in each group.

Qsub1 (First quartile)

Separates the bottom 25% of the sorted values from the top 75%.

Qsub1 = Psub25

Qsub2 (Second quartile)

Same as the median, separate the bottom 50% of the sorted values from the top 50%.

Qsub2 = Psub50

Qsub3 (Third quartile)

At least 75% of the sorted values are < or = to Qsub3 & at elast 25% of the values are > or = to Qsub3.

Qsub3 = Psub75

Interquartile range (IQR)

Qsub3 - Qsub1

5-number summary

Consists of the minimum value, the first quartile, the median (or second quartile), the third quartile, and the maximum value for a set of data.

Boxplot (or box-and-whisker diagram)

A graph of a data set that consists of a line extending from the minimum vlaue to the maximum value, & a box with lines drawn at the first quartile, the median, & the third quartile.

False positive

Test incorrectly indicates the presence of a condition (such as lying, being pregnant, or having some disease) when the subject does not actually have that condition.

False Negative

Test incorrectly indicates that the subject does not have the condition when the subject actualy does have that condition.

True positive

Test correctly indicates that the condition is present when it really is present.

True negative

Test correctly indicates that the condition is not present when it really is not present.

Measures of test reliability

Test sensitivity: the probability of a true positive.

Test specificity: the probability of a true negative.

Rare event rule for inferential statistics

If, under a given assumption, the probability of a particular observed event is extremely small, we conclude that the asusumption is probably not correct.

Event

Any collection of results or outcomes of a procedure.

Simple event

An outcome or an event that cannot be further broken down into simpler components.

Sample space

The sample space for a procedure consists of all possible simple events. That is, the sample space consists of all outcomes that cannot be boken down any further.

Relative frequency approximation of probability

Conduct or observe a procedure, & count the # of times that event A actually occurs. Based on the actual results, P(A) is approximated as follows:

P(A) = # of times A occurred/# of times the procedure was repeated.

Classical approach to probability (requires equally likely outcomes)

Assume that a given procedure has n different simple events and that each of those simple events has an = chance of occurring. If event A can occur in s of these n ways, then:

P(A) = # of ways A can occur/# of different simple events = s/n

Subjective probabilities

P(A), the probability of event A, is estimated by using knowledge of the relevant circmstances. Educated guess or estimate.

Three approaches to finding a probability

Relative Fequency approach

Classical approach

Subjective probability

Relative frequency approach

When trying to determine the probability that an individual car crashes in a year, we must examine past results to determine the # of cars in use in a year & the # of them that crashed, then we find the ratio of the # of cars that crashed to the total number of cars.

Classical approach

When trying to determine the probability of winning the grand prize in a lottery by selecting 6 numbers between 1 & 60, each combination has an equal chance of occurring. The probability of winning is .0000000200.

Subjective probability

When trying to estimate the probability of an astronaut surviving a mission in a space shuttle, experts consider past events along with changes in technologies & conditions to develop an estimate of the probability.

Law of large numbers

As a procedure is repeated again & again, the relative frequency probability of an event tends to approach the actual probability.

The probability of an impossible event

0

The probability of an event that is certain to occur

1

For any event A, the probability of A is between 0 and 1 inclusive.

0 < or = P(A) < or = 1

Complement

The probability that an event does not occur. The complement of event A, denoted by Abar, consists of all outcomes in which event A does not occur.

P(A or Abar) = P(A) + P(Abar) = 1

Rounding off probabilities

When expressing the value of a probability, either five the exact fraction or decimal or round off final decimal results to three significant digits. All digits in a # are significant except for the zeros that are included for proper placement of the decimal point.

Actual odds against event A occurring

The ratio P(Abar)/P(A), usually expressed in the form of a:b, (or "a to b"), where a and b are integers having no common factors.

Actual odds in favor of event A occurring

The payoff odds against event A occurring

The ratio of net profit (if you win) to the amount bet.

Compound event

Any event combining two or more simple events.

P(A or B) = P(in a single tiral, event A occurs or event B occurs or they both occur)

Rule for finding the probability that event A occurs or event B occurs.

Find the total of the # of ways A can occur & the # of ways B can occur, but find that total in such a way that no outcome is counted more than once.

Formal addition rule

P(A or B) = P(A) + P(B) - P(A and B)

where P(A and B) denotes the probability that A and B both occur at the same time as an outcome in a trial of a procedure.

Intuitive addition rule

To find P(A or B), find the sum of the # of ways event A can occur & the # of ways event B can occur, adding in such a way that every outcome is counted only once. P(A or B) is = to that sum, divided by the total # of outcomes in the sample space.

Disjoint (mutually exclusive)

Events A & B are disjoint (or mutually exclusive) if they cannot occur at the same time. Disjoint events do not overlap.

Rule of complementary events

P(A) + P(Abar) = 1

P(Abar) = 1 - P(A)

P(A) = 1 - P(Abar)

Multiplication Rule

P(A and B) = P(event A occurs in a 1st trial & event B occurs in a 2nd trial)

Conditional probability

P(B|A) represents the probability of event B occurring after it is assumed that event A has already occurred. (We can read B|A as "B given A" or as "event B occurring after event A has already occurred.")

Independent/ Dependent events

Two events A & B are independent if the occurrence of one does not affect the probability of the occurrence of the other. (Several events are similarly independent if the occurrence of any does not affect the probabilities of the occurence of the others). If A & B are not independent, they are said to be dependent.

Formal Multiplication Rule

P(A and B) = P(A) x P(B|A)

If A and B are independent events, P(B|A) is the same as P(B).

Intuitive Multiplication Rule

When finding the probability that event A occurs in 1 trial & event B occurs in the next trial, multiply the probability of event A by the probabilitiy of event B, but be sure that the probability of event B takes into account the previous occurrence of event A.

Treating dependent events as independent: The 5% guidleine for cumbersome calculations.

If calculations are very cumbersome and if a sample size is no more than 5% of the size of the population, treat the selections as being independent (even if the selections are made without replacement, so they are technically dependent).

Procedure for finding the probability of at least one of some event

Use the symbol A to denote the event of getting at least one.

Let Abar represent the event of getting none of the items being considered.

Calculate the probability that none of the outcomes results in the event being considerd.

Subtract the result from 1.

P(at least one) = 1 - P(none)

Conditional probability

A probability obtained with the additional information that some others event has already occurred. P(A|B) denotes the conditional probability of event B occurring, given that event A has already occurred. P(A|B) can be found by dividing the probability of events A & B both by the probability of event A:

P(B|A)= P(A and B)/P(A)

Intuitive approach to conditional probability

The conditional probability of B given A can e found by assuming that event A has occurred, & then calcualteing the probability that event B will occur.

Simulation

A process that behaves the same way as the procedure, so that similar results are produced.

Fundamental counting rule

For a sequence of two events in which the first event can occur m ways and the second event can occur n ways, the events together cna occur a total of m multiplied by n ways.

Factorial symbol (!)

Denotes the product of decreasing positive whole numbers.

4! = 4x3x2x1 = 24

By special definition, 0! = 1

Factorial Rule

A collection of n different items can be arranged in order n! different ways. (This factorial rule reflects the fact that the first item may ben selected n different ways, the second item may be selected n-1 ways, and so on.).

Permutations Rule (When items are all different)

There are n different items available.

We select r of the n items (without replacement).

We consider rearrangements of the same items to be different sequences. (The permutation of ABC is different from CBA & is counted separately).

If the preceding requirements are satisfied, the # of permutations (or sequences) of r items selected from n different available items (w/o replacement) is

_{n}P_{r}= n!/(n-r)!

Permutations Rule (When some items are identical to others)

There are n items available, & some items are identical to others.

We select all of the n items (w/o replacement).

We consider rearrangements of distinct items to be different sequences.

If the preceding requirements are satisfied, & if there are n_{1} alike, n_{2} alike, ..., n_{k} alike, the # of permutations (or sequences) of all items selected w/o replacement is

n!/n_{1}!n_{2}!...n_{k}!

Combinations rule

Requirements:

There are n different items available.

We select ro fo the n items (w/o replacement)

We consider rearrangements of the same items to be the same. (The combination ABC is the same as CBA.)

If the preceding requirements are satisfied, the # of combinations of r items selected from n diffferent items is

_{n}C_{r}= n!/(n-r)! r!

Rare event rule for inferential statistics

If, under a given assumption, the probability of a particular observed event is extremely small, we conclude that the assumption is probably not correct.

Random variable

A variable (typically represented by x) that has a single numerical value, determined by chance, for each outcome of a procedure.

Probability distribution

A description that gives the probability for each value of the random variable. It is often expressed in the format of a graph, table, or formula.

Discrete random variable

Has either a finite number of values or a countable number of values, where "countable" refers to the fact that there might be infinitely many values, but the can be associated with a counting process, so that the number of values is 1 or 1 or 2 or 3, etc.

Continuous random variable

Has infinitely many values, & those values can be associated with measurements on a continuous scale without gaps or interruptions.

Requirements for a probability distribution

Summation of P(x) = 1 where x assumes all possible values (The sum of all probabilities must be 1, but values such as .999 or 1.001 are acceptable because they result from rounding errors.)

0 < or = P(x) < or = 1 for every idividual value of x. (That is, each probability value must be between 0 & 1 inclusive).

Formulas for the

mean,

variance - easier to understand

variance - easier computations

standard deviation

for a probability distribution

mu = Sum[x * P(x)]

sigma^{2} = Sum [(x-mu)^{2} * P(x))]

sigma^{2} = Sum[x^{2} * P(x)] - mu^{2}

sigma = Square root of the sum[x^{2} * P(x)] - mu^{2}

Rounding-off rule for mu, sigma, & sigma^{2}

Round results by carrying one more decimal place than the number of decimal places used for the random variable x. If the values of x are integers, round mu, sigma, & sigma^{2} to one decimal place.

Range rule of thumb:

maximum usual value = mu + 2sigma

minimum usual value = mu - 2sigma

Rare event rule for inferential statistics

If, under a given assumption (such as the assumption that a coin is fair), the probability of a particular observed event (such as 992 heads in 1000 tosses of a coin) is extremely small, we conclude that the assumption is probably not correct.

Unusually high number of successes

x successes among n trials is an unusually high # of successes if the probability of x or more successes is unlikely with a probability of .05 or less. This criterion can be expesed follows:

P(x or more) < or = .05

Unusually low number of successes

x successes among n trials is an unusually low # of successes if the probability of x or fewer successes is unlikely with a probability of .05 or less. This criterion can b expressed as follows:

P(x or fewer) < or = .05.

Expected value

The expected value of a discrete random variable is denoted by E, & it represents the mean value of the outcomes. It is obtained by finding the value of Sum[x * P(x)]

Binomial probability distribution requirements

The procedure has a fixed number of trials.

The trials must be independent (The outcome of any individual trial doesn't affect the probabilities in the other trials).

Each trial must have all outcomes classified into two categories (commonly referred to as success & failure).

The probability of a success remains the same in all trials.

Poisson distribution

A discrete probability distribution that applies to occurrences of some event over a specified interval. The random variable x is the # of occurrences of the event in an interval. The interval can be time, distance, area, volume, or some similar unit.

P(x) = (mu^{x} * e^{-mu})/x! where e = approx. 2.71828

Requirements for the Poisson Distribution

The random variable x is the # of occurrences of an event over some interval.

The occurrences must be random.

The occurrences must be independent of each other.

The occurrences must be uniformly distirbuted over the interval being used.

Parameters of the Poisson Distribution

The mean is mu.

The standard deviation is

sigma = the square root of mu.

Differences between Binomial & Poisson Distributions

The binomial distribution is affected by the sample size n & the probability p, whereas the Poisson distribtuion is affected only by the mean mu.

In a binomial distribution, the possible values of the random variable x are 0, 1...n, but a Poisson distribution has possible x values of 0, 1, 2,..., with no upper limit.

Requirements for using the Poisson Distribution as an Approximation to the Binomial

n > or = 100

np < or = 10

Formula: mu = np

Normal distribution

If a continuous random variable has a distribution with a graph that is symmetric & bell-shaped, & it can be described by the equation below, we say it has a normal distribution.

y = e^{-1/2(x-mu/sigma)squared}/sigma * square root of 2Pie

Standard Normal Distribution

Properties:

Its graph is bell-shaped.

Its mean is = to 0 (that is, mu = 0).

Its standard deviation is = to 1 (that is, sigma = 1).

Uniform Distribution

A continuous random variable has a uniform distribution if its values are spread evenly over the range of possibilities. The graph of a uniform distribution results in a rectangular shape.

Properties of Uniform Distribution

The area under the graph of a probability distribution is = to 1.

There is a correspondence between area & probability (or relative frequency), so some probabilities can be found by identifying the corresponding areas.

Requirements for a density curve

The total area undeer the curve must = 1. Therefore there is a correspondence between area & probability.

Every point on the curve must have a vertical height that is 0 or greater. (That is, the curve cannot fall below the x-axis.)

Standard normal distribution

A normal probability distribution with mu = 0 and sigma = 1. The total area under its density curve is = to 1.

Table A-2

Designed only for the standard normal distribution, which has a mu of 0 & a sigma of 1.

Left page is negative; right is positive z scores.

Each value in the body of the table is a cumulative area from the left up to a vertical boundary above a specific z score.

Z scores: distance along the horizontal scale of the standard normal distribution; refer to the leftmost column & top row of Table A-2.

Area: region under the curve; values in body of A-2.

Procedure for finding a z score from a known area

Draw a bell-shaped curve & identify the region under the curve that corresponds to the given probability. If that region is not a cumulative region from the left, work instead with a known region that is a cumulative region from the left.

Table A-2: Using the cumulative area from the left, locate the closest probability in the body of Table A-2 & identify the corresponding z score.

Critical values

For a normal distribution, a critical value is a z score on the borderline separating the z scores that are likely to occur from those that are unlikely.

To work with a nonstandard normal distribution, we simply standardize values to use the procedures below.

If we convert values to standard z-scores using the formula below, then procedures for working with all normal distributions are the same as those for the standard normal distribution.

Z = x - mu/ sigma (round z scores to 2 decimal places).

Procedure for Converting from a nonstandard to a standard normal distribution

To find areas with a nonstandard normal distribution:

Sketch a normal curve, label the mean and the specific x vlaues, then shade the region representing the desired probability.

For each relevant value x that is a boundary for the shaded region, use the formula to convert that value to the equivalent z score. z = x - mu/sigma

Refer to Table A-2 to find the area of the shaded region. This area is the desired probability.

Procedure for finding values using Table A-2 and the formula

Sketch a normal dist. curve, enter the given probability or %age in the appropriate region of the graph, & I.D. the x value(s) sought. Use A-2 to find the z score corresponding to the cumulative left area bounded by x. Refer to the body of A-2 to find the closet area, then I.D. the corresponding z score. Enter the values for mu, sigma, & the z score from Step 2. Solve for x as follows: x = um + (z * sigma)

Refer to the curve sketch to verify the solution makes sense in the context of the graph & problem.

Sampling Distribution of a statistic

(Such as a sample mean or sample proportion) is the distribution of all values of the statistic when all possible samples of the same size n are taken from the same population. (The sampling distribution of a statistic is typically represented as a probability distribution in the format of a table, probability histogram, or formula.)

The sampling distribution of the mean

The distribution of sample means, with all samples having the same sample size n taken from the same population. (The sampling distribution of the mean is typically represented as a probability distribution in the format of a table, probability histogram, or formula.)

Properties of the sampling distribution of the mean

The sample means target the value of the population mean. (that is, the mean of the sample means is the population mean. The expected value of the sample mean is = to the population mean.)

The distribution of sample means tends to be a normal distribution.

Sampling distribution of the variance

The distribution of sample variances, with all samples having the same sample size n taken from the same population. (The sampling distribution of the variance is typically represented as a probability distribution in the format of a tabe, probability histogram, or formula.)

Properties of the sampling distribution of the variance

The sample variances target the value of the population variance. (That is, the mean of the sample variances is the population variance. The expeced value of the sample variance is = to the population variance.) The distribution of sample variances tends to be a distribution skewed to the right.

Sampling distribution of the proportion

The distribution of sample proportions, with all samples having the same sample size n taken from the same population.

Notation for proportions

p = population proportion

p"hat" = sample proportion

Properties of the sampling distribution of the proportion

The sample proportions target the value of the population proportion. (That is, the mean of the sample proportions is the polulation proportion. The expected value of the sample proportion is = to the population proportion.) The distribution of sample proportions tends to be a normal distribution.

Unbiased estimators

Sample means, variances, & proportions tend to target the corresponding population parameters. Sample means, variances, & proportions are unbiased estimators. Their sampling distributions have a mean that is = to the mean of the corresponding poulation parameter.

Unbiased estimators

These statistics are unbiased estimators. They target the value of the population parameter:

Mean: xbar

Variance: sigma^{2}

Proportion: p"hat"

Biased estimators

These statistics are biased estimators. They do not target the population parameter:

Median

Range

Standard deviation: s (Note: the sample standard deviations do not target the population standard deviation sigma, but the bias is relatively small in large samples, so s is often used to estimate even though s is a baised estimator of sigma.)

Central Limit Theorem & the Sampling Distribution of xbar (Givens)

The random variable x has a distribution (which may or may not be normal with mean mu and standard deviation sigma. Simple random samples all of the same size n are selected from the popoulation. (The samples are selected so that all possible samples of size n have the same chance of being selected.)

Central Limit Theorem & the Sampling Distribution of xbar (Conclusions)

The distribution of sample means xbar will, as the sample size increases, approach a normal distribution. The mean of all sample means is the population mean mu. The standard deviation of all sample means is sigma/square root of n.

Central Limit Theorem & the Sampling Distribution of xbar (Practical rules commonly used)

If the original pop. is not normally distributed: For n > 30, the dist. of the sample means can be approx. reasonably well by a normal dist. (Exceptions: populations w/ very nonnormal dist. requiring sample sizes larger than 30, but this is relatively rare.) The distibution of sample means gets closer to a normal distribution as the sample size n becomes larger. If the original population is normally distributed then for any sample size n, the sample means will be normally distributed.

Principles when selecting a simple random sample of n subjects from a population with mean mu & standard deviation sigma

For a population w/ any distribution, if n > 30, then the sample means have a distribution that can be approximated by a normal distribution with mean mu & standard deviation sigma/square root of n. If n < or = to 30 & the original population has a normal distribution, then the sample means have a normal distribution with mean mu & standard deviation sigma/square root of n. If n < or = to 30 & the original population does not have a normal distirbution, then these methods do not apply.

Notation for the Sampling Distribution of xbar

If all possible random samples of size n are selected from a population iwth mean mu and standard deviation sigma, the mean of the sample means is denoted by mu_{xbar}, so

mu_{xbar} = mu

Also, the standard deviation of the sample means is denoted by sigma_{xbar}, so

sigma_{xbar} = sigma/square root of n

sigma_{xbar} is called the standard error of the mean.

Applying the central limit theorem for an individual value

When working with an individual value from a normally distributed population, use:

z = x - mu/sigma

Applying teh central limit theorem for a sample of values

When working with a mean for some sample (or group), be sure to use the value of sigma/square root of n for the stnadard deviation of the sample means. Use:

z = xbar - mu/(sigma/sqaure root of n)

Finite population correction factor

When sampling without replacement & the sample size n is greater than 5% of the finite popoulation size N (that is, n > .05N), adjust the standard deviation of sample means sigma_{xbar} by multiplying it by the finite opoulation correction factor:

square root of (N-n/N-1)

Normal Distribution as an Approximation to the Binomial Distribution (Requirements)

The sample is a simple random sample of size n from a population in which the proportion of successes is p, or the sample is the result of conducting n independent trials of a binomial experiment in which the probability of success is p.

np > or = 5 and nq > or = to 5.

Normal Distribution as an Approximation to the Binomial Distribution (Normal approximation)

If the above requirements are satisfied, then the probability distribution of the random variable x can be approximated by a normal distribution with these parameters:

mu = np

sigma = square root of npq

Normal Distribution as an Approximation to the Binomial Distribution (Continuity Correction)

When using the normal approximation, adjust the discrete whole number x by using a continuity correction, so that x is represented by the interval from x - 0.5 to x + 0.5

Steps for using the normal distribution to approximate the binomial distribution

1. Check to see whether the normal approximation can be used

2. Find the mean and standard deviation

3. Write the problem in probability notation

4. Rewrite using the continuity correction factor

5. Show the corresponding area under the normal distribution curve

6. Find the corresponding z-values (using the continuity correction factor)

7. Find the solution

Continuity correction

When we use the normal distribution (which is a continuous probability distribution) as an approximation to the binomial distribution (which is discrete), a continuity correction is made to a discrete whole number x in the binomial distribtuion by representing the discrete whole number x by the interval from x - 0.5 to x + 0.5 (that is, adding & subtracting 0.5)

Continuity corrections statements

at least 8 (includes 8 & above) - area to the right of 7.5

More than 8 (doesn't include 8) - area to the right of 8.5

At most 8 (includes 8 & below) - area to the left of 8.5

Fewer than 8 (doesn't include 8) - area to the left of 7.5

Exactly 8 - area between 7.5 and 8.5

Using probabilities to determine when results are unusual

x successes among n trials is an unusually high number of successes if P(x or more) is very small (such as 0.05 or less).

x successes among n trials is an unusually low number of successes if P(x or fewer) is very small (such as 0.05 or less).

Asessing normality

Visual inspection of a histogram to see if it is roughly bell-shaped;

identifying any outliers; consturcting a graph called a normal quantile plot.

Normal quantile plot (or normal probability plot)

A graph of points (x,y) where each x value is from the original set of sample data, and each y value is the corresponding z score that is a quantile value expected from the standard normal distribution.

Normal distribution

The population distribution is normal if the pattern of the points is reasonably close to a straight line and the points do not show some systematic pattern that is not a straight-line pattern

Not a normal distribution

The population distribtuion is not normal if either or both of these two conditions applies: The points do not lie reasonably close to a straight line; the points show some systematic pattern that is not a straight-line pattern.

Point estimate

A single value (or point) used to approximate a population parameter).

The sample proportion p"hat" is the best point estimate of the population proportion (p).

Confidence interval (or interval estimate)

A range (or an interval) of values used to estimate the true value of a population parameter. A confidence interval is sometimes abbreviated CI.

Where alpha is teh complement of the confidence level; For a 0.95 or 95% confidence level, alpha = 0.05. For a 0.99 (or 99%) confidence level, alpha = 0.01.

Confidence level

The probability 1 - alpha (often expressed as the equivalent percentage value) that the confidence interval acutally does contain the population parameter, assuming tha thte estimation process is repeated a large number of times. (The confidence level is also called the degree of confidence, or the confidence coefficient.)

Critical value

The number on the borderline separating sample statistics that are likely to occur from those that are unlikely to occur. The number z_{alpha/2} is a critical value that is a z score with the property that it separates an area of alpha/w in the right tail of the standard normal distribution.

Critical values are based on the following observations

Under certain conditions, the sampling distribtuion of sample proportions can be approximated by a normal distribution.

A z score associated with a sample proportion has a probability of alpha/2 of falling in the right tail.

The z score separating the right-tail region is commonly denoted by z_{alpha/2}, and is referred to as a critical value because it is on the borderline separating z scores from sample proportions that are likely to occur from thsoe that are unlikely to occur.

Margin of error

Notation for confidence interval for estimating a population proportion p

p = population proportion

p"hat" = sample proportion

n = number of sample values

E = margin of error

z_{alpha/2} = z score separating an area of alpha/2 in the right tail of the standard normal distribution.

q"hat" = 1 - p"hat"

Requirements for Confidence intervals

The sample is a simple random sample. The conditions for the binomial distribution are satisfied. There are a fixed number of trials, the trials are independent, there are 2 categories of outcomes, & the probabilities remain constant for each. There are at least 5 successes & at least 5 failures (with the population proportions p & q unknown, we estimate their values using teh sample proportion.

Confidence interval

p"hat" - E<p<p"hat" + E or p"hat" + or - E

E = z_{alpha/2} * (square root of p"hat"*q"hat"/n)

Procedure for constructing a confidence interval for p

Verify that the requirements are satisfied.

Refer to Table A-2 to find the critical value z_{alpha/2} that corresponds to the desired confidence level.

Evaluate the margin of error E = z_{alpha/2} *(square root of p"hat" * q"hat"/n.

Using that value for E & the value of p"hat", find the values of the confidence interval limits p"hat" - E & p"hat" + E. Substitute: p"hat" - E < p< p"hat" + E

Round the resulting confidence interval limits to three significant digits.

Finding the sample size required to estimate a population proportion Notation

p = population proportion

p"hat" = sample proportion

n = number of sample values

E = desired margin of error

z_{alpha/2} = z score separating an area of alpha/2 in the right tail of the standard normal distribution

q"hat" = 1 - p"hat"

Finding the sample size required to estimate a population proportion Requirements

The sample must be a simple random sample of independent subjects:

When an estimate p"hat" is known:

n = [z_{alpha/2}]^{2}*p"hat"*q"hat"/E^{2}

When no estimate p"hat" is known:

n = [z_{alpha/2}]^{2}* 0.25/E^{2}

Finding the Point estimate of p

p"hat" = (upper confidence interval limit) + (lower confidence interval limit)/2

Finding Margin of Error from a confidence interval

E = (upper confidence interval limit) - (lower confidence interval limit)

Point estimate

The sample mean xbar is the best point estimate of the population mean.

Confidence Interval for Estimating a population Mean (with sigma known) Requirements:

Sample is a simple random sample. Value of pop. standard deviation sigma is known. Either or both of these is satisfied: the pop.is normally distributed or n>30. Confidence interval: xbar - E < mu < xbar + E where E = z_{alpha/2}*(sigma/square root of n)

mu = population mean; sigma = population standard deviation; xbar = sample mean; n= # of sample values; E = margin of error; z_{alpha/2} = z score separating an area of alpha/2 in the right tail of the standard normal distribution.

Round-off rule for confidence intervals used to estimate mu

When using the original set of data to construct a confidence interval, round the confidence interval limits to one more decimal place than is used for the original set of data. When the original set of data is unknown & only the summary statistics (n, Xbar, s) are used, round the confidence interval limits to the same number of decimal places used for the sample mean.

Determining sample size required to _{estimate the population mean mu}

mu = population mean; sigma = population standard deviation; xbar = sample mean; E = desired margin of error; z_{alpha/2} = z score separating an area of alpha/2 in the right tail of the standard normal deviation. Requirement: The sample must be a simple random sample.

n = [(z_{alpha/2}*sigma)/E]^{2}

Dealing with unknown sigma when finding sample size

Use the range rule of thumb to estimate the standard deviation (sigma = range/4). Start the sample collection process without knowing sigma, and using the 1st several values, calculate the sample standard deviation s and use it in place of sigma. The estimated value of sigma can then be improved as more sample data are obtained, & the sample size can be refined accordingly. Estimate the value of sigma by using the results of some other study that was done earlier.

Estimating a population mean: sigma not known

The sample mean xbar is the best point estimate of the population mean mu.

If sigma is not know, but the relevant requirements are satisfied, we use Student t distribution (instead of a normal distribution).

Student t distribution

If a population has a normal distribution, then the distribution of

t = (xbar - mu)/(s/square root of n)

is a student t distribution for all samples of size n.

Degrees of freedom

The number of degrees of freedom (df) for a collection of sample dats is the number of sample values that can vary after certain restrictions have been imposed on all data values.

df = n-1

Finding the confidence interval for estimating a population mean (with sigma not known)

mu = population mean; xbar = sample mean; s = sample standard deviation; n - # of sample values; E = margin of error; t_{alpha/2}= critical t value separating an area of alpha/2 in the right tail of the distribution. Requirements: the sample is a simple random sample; either the sample is from a normally distributed population or n > 30.

xbar - E < mu < xbar + E where E = t_{alpha/2}(s/square root of n) (df = n-1)

Important properties of the student t distribution

The student t distribution has the same general symmetric bell shape as the standard normal distribution, but it reflects the greater variability (with wider distributions) that is expected with small samples. The student t distribution has a mean of t = 0 (just as the standard normal distribution has a mena of z = 0.

Choosing between z and t

Use normal (z) distribution when sigma is known & normally distributed population or sigma is known and n > 30.

Use t distribution when sigma is not known & normally distributed population or sigma is not known & n > 30.

Use a nonparametric method or bootstrapping when the population is not normally distributed & n < or = 30.

Confidence interval for estimating a population standard deviation or variance

Requirements: the sample is a simple random sample; the population must have normally distributed values (even is the sample is large).

sigma = population standard deviation; s = sample standard deviation; n = # of sample values; X^{2}_{L }= left-tailed critical value of x^{2}; sigma^{2}= population variance; s^{2} = sample variance; E = margin of error; X^{2}_{L} = right-tailed critical value of X^{2}

The two main activities of inferential statistics are using sample data to:

estimate a population parameter (such as with a confidence interval) and test a hypothesis or claim about a population parameter.

Hypothesis

A claim or statement about a property of a population.

Hypothesis test (or test of significance)

A procedure for testing a claim about a property of a population.

Power

The power of a hypothesis test is the probability (1 - beta) of rejecting a false null hypothesis. The value of the power is computed by using a particular significance level alpha and a particular value of the population parameter that is an alternative to the value assumed true in the null hypothesis.

Correlation

A correlation exists between two variables when the values of one variable are somehow associated with the values of the other variable.

Linear correlation coefficient (Pearson product moment correlation coefficient)

"r" measures teh strength of the linear correlation between the paired quantitative x and y values in a sample.

Regression equation

Regression line

Given a collecgtion of paired sample data, the regression equation:

y(hat) = b_{0} + b_{1}*x

algebraically describes the relationship between the two variables x & y.

Th graph of the regression equation is called the regression line (or line of best fit, or least-squares line).

Marginal change

In working with 2 variables related by a regression equation, the marginal change is a variable is the amount that it changes when the other variable changes by exactly one unit. The slope b_{1} in the regression equation represents the marginal change in y that occurs when x changes by one unit.

Influential points

Paired sample data may include one or more influential points, which are points that strongly affect the graph of the regression line.

Residual

For a pair of sample x & y values, the residual is the difference between the observed sample value of y and the y value that is predicted by using the regression equation. That is,

residual = observed y - predicted y = y - y"hat"

Least-squares property

A straight line satisfies the least-squares property if the sum of the squares of the residuals is the smallest sum possible.

Residual plot

A scatterplot of the (x, y) values after each of the y coordinate values has been replaced by the residual value y - y"hat" (where y"hat" denotes the predicted value of y). That is, a residual plot is a graph of the points (x, y - y"hat")

Total deviation

The total deviation of (x, y) is the vertical distance y - y"hat", which is the distance between the point (x, y) and the horizontal line passing through the sample mean ybar.

Explained deviation

The vertical distance y"hat" - ybar, which is the distance between the predicted y value & the horizontal line passing through the sample mean ybar.

Unexplaned deviation

The vertical distance y - y"hat", which is the vertical distance between teh point (x, y) and teh regression line. (The distance y - y"hat" is also called a residual.)

Coefficient of determination

The amount of the variation in y that is explained by the regression line. It is computed as

r^{2} = explained variation/total variation

Predicted interval

An interval estimate of a predicted value of y.

Standard error of estimate s_{e}

A measure of the differences (or distances) between the observed sample y values & the predicted values y"hat" that are otained using the regression equation. It is given as

s_{e}= Square root of (sum (y - y"hat")^{2}/n - 2)

or

s_{e} = square root of (sumy^{2} - b_{0}sumy - b_{1}sumxy/n-2

Multiple regression equation

Expresses a linear relationship between a response variable y & two or more predictor variables (x_{1}, x_{2, ...}x_{k})

Adjusted coefficient of determination

The multiple coefficient of determination R^{2} modified to account for the number of variables and the sample size.

Null hypothesis H_{0}

A statement that the value of a population parameter (such as proportion, mean, or standard deviation) is = some claimed value. (The term null is used to indicate no change or no effect or no difference).

Example: H_{0}: p = 0.5

We test the null hypothesis directly in the sense that we assume (or pretend) it is true & reach a conclusion to either reject it or fail to reject it.

Alternative Hypothesis H_{1} or H_{a}

A statement that the parameter has a value that somehow differs from the null hypothesis.

Type I error in Hypothesis Tests

The mistake of rejecting the null hypothesis when it is actually tre. The symbol alpha is used to represent the probability of a type I error.

Type II error in hypothesis tests

The mistake of failing to reject the null hypothesis when it is actually false. The symbol beta is used to represent the probability of a type II error.

"The semester I found StudyBlue, I went from a 2.8 to a 3.8, and graduated with honors!"

Jennifer Colorado School of Mines© 2014 StudyBlue Inc. All rights reserved.