# Statistics

- StudyBlue
- Virginia
- Lord Fairfax Community College
- Statistics

**Created:**2014-08-22

**Last Modified:**2014-08-22

Must consider:

Context of the data;

Source of the data;

Sampling method;

Conclusions; and

Practical implications

What is the context of the data?

What is the source of the data?

How were the data obtained?

What can we conclude from the data?

Based on statistical conclusions, what practical implications result from our analysis?

1)Identify the studies goal, population considered, & study type. 2) Consider the source, esp. w/regard to possibility of bias. 3)Analyze sampling method. 4)Look for problems defining/measuring variables of interest. 5) Watch for confounding variables that could invalidate conclusions. 6) Consider setting & wording of any survey. 7) Check that graphs represent data fairly, & conclusions are justified. 8)

Consider if conclusions achieve study's goals, make sense, & have practical significance.

All subjects in a study must give their informed consent.

All results from individuals must remain confidential.

The well-being of study subjects must always take precedence over the benefits to society.

_{Quantitative ( or numerical) data}

- nominal
- ordinal
- interval
- ratio

Parameter vs. Statistic

In a large sample of households, the median annual income per household for high school graduates is $19,856 (based on data from the U.S. Census Bureau).

Parameter vs. Statistic

A study of all 2223 passengers aboard the Titanic found that 706 survived when it sank.

Parameter vs. Statistic

If the areas of the 50 states are added and the sum is divided by 50, the result is 196,533 square kilometers.

Parameter vs. Statistic

The author measured the voltage supplied to his home on 40 different days, and the average (mean) value is 123.7 volts.

Discrete or Continuous Data Set

In New York City, there are 3250 walk buttons that pedestrians can press at traffic intersections, and 2500 of them do not work. (based on..._

Discrete or Continuous Data Set

The amount of nicotinein a Marlboro cigarette is 1.2 milligrams.

Discrete or Continuous Data Set

In a test of a method of gender selection developed by the Genetics & IVF Institute, 726 couples used the XSORT method and 668 of them had baby girls.

Discrete or Continuous Data Set

When a Cadillac STS is randomly selected and weighed, it is found to weigh 1827.9 kg.

Nominal, Ordinal, Interval, Ratio

Voltage measurements from the author's home

Nominal, Ordinal, Interval, Ratio

Critic ratings of movies on a scale from 0 star to 4 stars

Nominal, Ordinal, Interval, Ratio

Companies (Disney, MGM, Warner Brothers, Universal, 20th Century Fox) that produced the movies listed in Data Set 7 in Appendix B.

Nominal, Ordinal, Interval, Ratio

Years in which movies were released, as listed in Data Set 9 in Appendix B.

*k*th (such as every 50th) element in the population.

_{select}some of those clusters, and then choose all the members from those selected clusters.

The measure of variation most commonly used in statistics. A set of sample values, denoted by s, is a measure of variation of vlaues about the mean calculated using:

s = √(∑[(x - ¯x)^{2}/(n-1)]

Symbol = *s*^{2}

^{It is an unbiased estimate of the population variance. The values of s squared tend to target the value of sigma squared instead of systematically tending to overestimate or underestimate sigma squared. }

The principle that for many data sets, the vast majority (such as 95%) of sample values lie within two standard deviations of the mean.

Minimum "usual" value = (mean) - 2 x (standard deviation).

Maximum "usual" value = (mean) + 2 x (standard deviation).

The CV for a set of nonnegative sample or population data, expressed as a %, describes the standard deviation relative to the mean, and is given by the following:

Sample:

CV = (s/xbar) multiplied by 100%

Population:

CV = (sigma/mu) multiplied by 100%

A measure of relative standing. The number of standard deviations that a given value x is above or below the mean. Formula:

Sample:

z = (x - xbar)/s

Population:

x = (x - mu)/sigma

Ordinary values: -2 is < or = to z score is < or = to 2

Unusual values: z score < -2 or z score > 2

Measures of location, denoted by Psub# which divide a set of data into 100 groups with about 1% of the values in each group.

Percentile of value x = (number of values less than x / total number of values) multiplied by 100. Round to the nearest whole number.

Separates the bottom 25% of the sorted values from the top 75%.

Qsub1 = Psub25

Same as the median, separate the bottom 50% of the sorted values from the top 50%.

Qsub2 = Psub50

At least 75% of the sorted values are < or = to Qsub3 & at elast 25% of the values are > or = to Qsub3.

Qsub3 = Psub75

Test sensitivity: the probability of a true positive.

Test specificity: the probability of a true negative.

Conduct or observe a procedure, & count the # of times that event A actually occurs. Based on the actual results, P(A) is approximated as follows:

P(A) = # of times A occurred/# of times the procedure was repeated.

Assume that a given procedure has n different simple events and that each of those simple events has an = chance of occurring. If event A can occur in s of these n ways, then:

P(A) = # of ways A can occur/# of different simple events = s/n

Relative Fequency approach

Classical approach

Subjective probability

1

0 < or = P(A) < or = 1

The probability that an event does not occur. The complement of event A, denoted by Abar, consists of all outcomes in which event A does not occur.

P(A or Abar) = P(A) + P(Abar) = 1

The ratio P(A)/P(Abar), which is the reciprocal of the actual odds against that event. Theodds in favor of A are b:a

Any event combining two or more simple events.

P(A or B) = P(in a single tiral, event A occurs or event B occurs or they both occur)

P(A or B) = P(A) + P(B) - P(A and B)

where P(A and B) denotes the probability that A and B both occur at the same time as an outcome in a trial of a procedure.

P(A) + P(Abar) = 1

P(Abar) = 1 - P(A)

P(A) = 1 - P(Abar)

P(A and B) = P(A) x P(B|A)

If A and B are independent events, P(B|A) is the same as P(B).

Use the symbol A to denote the event of getting at least one.

Let Abar represent the event of getting none of the items being considered.

Calculate the probability that none of the outcomes results in the event being considerd.

Subtract the result from 1.

P(at least one) = 1 - P(none)

A probability obtained with the additional information that some others event has already occurred. P(A|B) denotes the conditional probability of event B occurring, given that event A has already occurred. P(A|B) can be found by dividing the probability of events A & B both by the probability of event A:

P(B|A)= P(A and B)/P(A)

Denotes the product of decreasing positive whole numbers.

4! = 4x3x2x1 = 24

By special definition, 0! = 1

There are n different items available.

We select r of the n items (without replacement).

We consider rearrangements of the same items to be different sequences. (The permutation of ABC is different from CBA & is counted separately).

If the preceding requirements are satisfied, the # of permutations (or sequences) of r items selected from n different available items (w/o replacement) is

_{n}P_{r}= n!/(n-r)!

There are n items available, & some items are identical to others.

We select all of the n items (w/o replacement).

We consider rearrangements of distinct items to be different sequences.

If the preceding requirements are satisfied, & if there are n_{1} alike, n_{2} alike, ..., n_{k} alike, the # of permutations (or sequences) of all items selected w/o replacement is

n!/n_{1}!n_{2}!...n_{k}!

Combinations rule

Requirements:

There are n different items available.

We select ro fo the n items (w/o replacement)

We consider rearrangements of the same items to be the same. (The combination ABC is the same as CBA.)

If the preceding requirements are satisfied, the # of combinations of r items selected from n diffferent items is

_{n}C_{r}= n!/(n-r)! r!

Summation of P(x) = 1 where x assumes all possible values (The sum of all probabilities must be 1, but values such as .999 or 1.001 are acceptable because they result from rounding errors.)

0 < or = P(x) < or = 1 for every idividual value of x. (That is, each probability value must be between 0 & 1 inclusive).

Formulas for the

mean,

variance - easier to understand

variance - easier computations

standard deviation

for a probability distribution

mu = Sum[x * P(x)]

sigma^{2} = Sum [(x-mu)^{2} * P(x))]

sigma^{2} = Sum[x^{2} * P(x)] - mu^{2}

sigma = Square root of the sum[x^{2} * P(x)] - mu^{2}

^{2}

^{2}to one decimal place.

maximum usual value = mu + 2sigma

minimum usual value = mu - 2sigma

Rare event rule for inferential statistics

x successes among n trials is an unusually high # of successes if the probability of x or more successes is unlikely with a probability of .05 or less. This criterion can be expesed follows:

P(x or more) < or = .05

x successes among n trials is an unusually low # of successes if the probability of x or fewer successes is unlikely with a probability of .05 or less. This criterion can b expressed as follows:

P(x or fewer) < or = .05.

The procedure has a fixed number of trials.

The trials must be independent (The outcome of any individual trial doesn't affect the probabilities in the other trials).

Each trial must have all outcomes classified into two categories (commonly referred to as success & failure).

The probability of a success remains the same in all trials.

A discrete probability distribution that applies to occurrences of some event over a specified interval. The random variable x is the # of occurrences of the event in an interval. The interval can be time, distance, area, volume, or some similar unit.

P(x) = (mu^{x} * e^{-mu})/x! where e = approx. 2.71828

The random variable x is the # of occurrences of an event over some interval.

The occurrences must be random.

The occurrences must be independent of each other.

The occurrences must be uniformly distirbuted over the interval being used.

The mean is mu.

The standard deviation is

sigma = the square root of mu.

The binomial distribution is affected by the sample size n & the probability p, whereas the Poisson distribtuion is affected only by the mean mu.

In a binomial distribution, the possible values of the random variable x are 0, 1...n, but a Poisson distribution has possible x values of 0, 1, 2,..., with no upper limit.

n > or = 100

np < or = 10

Formula: mu = np

If a continuous random variable has a distribution with a graph that is symmetric & bell-shaped, & it can be described by the equation below, we say it has a normal distribution.

y = e^{-1/2(x-mu/sigma)squared}/sigma * square root of 2Pie

Properties:

Its graph is bell-shaped.

Its mean is = to 0 (that is, mu = 0).

Its standard deviation is = to 1 (that is, sigma = 1).

The area under the graph of a probability distribution is = to 1.

There is a correspondence between area & probability (or relative frequency), so some probabilities can be found by identifying the corresponding areas.

Requirements for a density curve

The total area undeer the curve must = 1. Therefore there is a correspondence between area & probability.

Every point on the curve must have a vertical height that is 0 or greater. (That is, the curve cannot fall below the x-axis.)

Table A-2

Designed only for the standard normal distribution, which has a mu of 0 & a sigma of 1.

Left page is negative; right is positive z scores.

Each value in the body of the table is a cumulative area from the left up to a vertical boundary above a specific z score.

Z scores: distance along the horizontal scale of the standard normal distribution; refer to the leftmost column & top row of Table A-2.

Area: region under the curve; values in body of A-2.

Draw a bell-shaped curve & identify the region under the curve that corresponds to the given probability. If that region is not a cumulative region from the left, work instead with a known region that is a cumulative region from the left.

Table A-2: Using the cumulative area from the left, locate the closest probability in the body of Table A-2 & identify the corresponding z score.

If we convert values to standard z-scores using the formula below, then procedures for working with all normal distributions are the same as those for the standard normal distribution.

Z = x - mu/ sigma (round z scores to 2 decimal places).

To find areas with a nonstandard normal distribution:

Sketch a normal curve, label the mean and the specific x vlaues, then shade the region representing the desired probability.

For each relevant value x that is a boundary for the shaded region, use the formula to convert that value to the equivalent z score. z = x - mu/sigma

Refer to Table A-2 to find the area of the shaded region. This area is the desired probability.

Sketch a normal dist. curve, enter the given probability or %age in the appropriate region of the graph, & I.D. the x value(s) sought. Use A-2 to find the z score corresponding to the cumulative left area bounded by x. Refer to the body of A-2 to find the closet area, then I.D. the corresponding z score. Enter the values for mu, sigma, & the z score from Step 2. Solve for x as follows: x = um + (z * sigma)

Refer to the curve sketch to verify the solution makes sense in the context of the graph & problem.

The sample means target the value of the population mean. (that is, the mean of the sample means is the population mean. The expected value of the sample mean is = to the population mean.)

The distribution of sample means tends to be a normal distribution.

p = population proportion

p"hat" = sample proportion

These statistics are unbiased estimators. They target the value of the population parameter:

Mean: xbar

Variance: sigma^{2}

Proportion: p"hat"

These statistics are biased estimators. They do not target the population parameter:

Median

Range

Standard deviation: s (Note: the sample standard deviations do not target the population standard deviation sigma, but the bias is relatively small in large samples, so s is often used to estimate even though s is a baised estimator of sigma.)

If all possible random samples of size n are selected from a population iwth mean mu and standard deviation sigma, the mean of the sample means is denoted by mu_{xbar}, so

mu_{xbar} = mu

Also, the standard deviation of the sample means is denoted by sigma_{xbar}, so

sigma_{xbar} = sigma/square root of n

sigma_{xbar} is called the standard error of the mean.

When working with an individual value from a normally distributed population, use:

z = x - mu/sigma

When working with a mean for some sample (or group), be sure to use the value of sigma/square root of n for the stnadard deviation of the sample means. Use:

z = xbar - mu/(sigma/sqaure root of n)

When sampling without replacement & the sample size n is greater than 5% of the finite popoulation size N (that is, n > .05N), adjust the standard deviation of sample means sigma_{xbar} by multiplying it by the finite opoulation correction factor:

square root of (N-n/N-1)

The sample is a simple random sample of size n from a population in which the proportion of successes is p, or the sample is the result of conducting n independent trials of a binomial experiment in which the probability of success is p.

np > or = 5 and nq > or = to 5.

If the above requirements are satisfied, then the probability distribution of the random variable x can be approximated by a normal distribution with these parameters:

mu = np

sigma = square root of npq

1. Check to see whether the normal approximation can be used

2. Find the mean and standard deviation

3. Write the problem in probability notation

4. Rewrite using the continuity correction factor

5. Show the corresponding area under the normal distribution curve

6. Find the corresponding z-values (using the continuity correction factor)

7. Find the solution

at least 8 (includes 8 & above) - area to the right of 7.5

More than 8 (doesn't include 8) - area to the right of 8.5

At most 8 (includes 8 & below) - area to the left of 8.5

Fewer than 8 (doesn't include 8) - area to the left of 7.5

Exactly 8 - area between 7.5 and 8.5

x successes among n trials is an unusually high number of successes if P(x or more) is very small (such as 0.05 or less).

x successes among n trials is an unusually low number of successes if P(x or fewer) is very small (such as 0.05 or less).

Visual inspection of a histogram to see if it is roughly bell-shaped;

identifying any outliers; consturcting a graph called a normal quantile plot.

A single value (or point) used to approximate a population parameter).

The sample proportion p"hat" is the best point estimate of the population proportion (p).

A range (or an interval) of values used to estimate the true value of a population parameter. A confidence interval is sometimes abbreviated CI.

Where alpha is teh complement of the confidence level; For a 0.95 or 95% confidence level, alpha = 0.05. For a 0.99 (or 99%) confidence level, alpha = 0.01.

_{alpha/2}is a critical value that is a z score with the property that it separates an area of alpha/w in the right tail of the standard normal distribution.

Under certain conditions, the sampling distribtuion of sample proportions can be approximated by a normal distribution.

A z score associated with a sample proportion has a probability of alpha/2 of falling in the right tail.

The z score separating the right-tail region is commonly denoted by z_{alpha/2}, and is referred to as a critical value because it is on the borderline separating z scores from sample proportions that are likely to occur from thsoe that are unlikely to occur.

When using data from a simple random sample to estimate a population proportion p, the margin of error, (E), is the maximum likely difference (w/ probability 1 - alpha, such as 0.95) between observed sample proportion p"hat" & true value of the population proportion p. To find E (or the maximum error of the estimate) multiply the critical value & standard deviation of sample proportions,

p = population proportion

p"hat" = sample proportion

n = number of sample values

E = margin of error

z_{alpha/2} = z score separating an area of alpha/2 in the right tail of the standard normal distribution.

q"hat" = 1 - p"hat"

p"hat" - E<p<p"hat" + E or p"hat" + or - E

E = z_{alpha/2} * (square root of p"hat"*q"hat"/n)

Verify that the requirements are satisfied.

Refer to Table A-2 to find the critical value z_{alpha/2} that corresponds to the desired confidence level.

Evaluate the margin of error E = z_{alpha/2} *(square root of p"hat" * q"hat"/n.

Using that value for E & the value of p"hat", find the values of the confidence interval limits p"hat" - E & p"hat" + E. Substitute: p"hat" - E < p< p"hat" + E

Round the resulting confidence interval limits to three significant digits.

p = population proportion

p"hat" = sample proportion

n = number of sample values

E = desired margin of error

z_{alpha/2} = z score separating an area of alpha/2 in the right tail of the standard normal distribution

q"hat" = 1 - p"hat"

The sample must be a simple random sample of independent subjects:

When an estimate p"hat" is known:

n = [z_{alpha/2}]^{2}*p"hat"*q"hat"/E^{2}

When no estimate p"hat" is known:

n = [z_{alpha/2}]^{2}* 0.25/E^{2}

p"hat" = (upper confidence interval limit) + (lower confidence interval limit)/2

Sample is a simple random sample. Value of pop. standard deviation sigma is known. Either or both of these is satisfied: the pop.is normally distributed or n>30. Confidence interval: xbar - E < mu < xbar + E where E = z_{alpha/2}*(sigma/square root of n)

mu = population mean; sigma = population standard deviation; xbar = sample mean; n= # of sample values; E = margin of error; z_{alpha/2} = z score separating an area of alpha/2 in the right tail of the standard normal distribution.

_{estimate the population mean mu}

mu = population mean; sigma = population standard deviation; xbar = sample mean; E = desired margin of error; z_{alpha/2} = z score separating an area of alpha/2 in the right tail of the standard normal deviation. Requirement: The sample must be a simple random sample.

n = [(z_{alpha/2}*sigma)/E]^{2}

The sample mean xbar is the best point estimate of the population mean mu.

If sigma is not know, but the relevant requirements are satisfied, we use Student t distribution (instead of a normal distribution).

If a population has a normal distribution, then the distribution of

t = (xbar - mu)/(s/square root of n)

is a student t distribution for all samples of size n.

The number of degrees of freedom (df) for a collection of sample dats is the number of sample values that can vary after certain restrictions have been imposed on all data values.

df = n-1

mu = population mean; xbar = sample mean; s = sample standard deviation; n - # of sample values; E = margin of error; t_{alpha/2}= critical t value separating an area of alpha/2 in the right tail of the distribution. Requirements: the sample is a simple random sample; either the sample is from a normally distributed population or n > 30.

xbar - E < mu < xbar + E where E = t_{alpha/2}(s/square root of n) (df = n-1)

Use normal (z) distribution when sigma is known & normally distributed population or sigma is known and n > 30.

Use t distribution when sigma is not known & normally distributed population or sigma is not known & n > 30.

Use a nonparametric method or bootstrapping when the population is not normally distributed & n < or = 30.

Requirements: the sample is a simple random sample; the population must have normally distributed values (even is the sample is large).

sigma = population standard deviation; s = sample standard deviation; n = # of sample values; X^{2}_{L }= left-tailed critical value of x^{2}; sigma^{2}= population variance; s^{2} = sample variance; E = margin of error; X^{2}_{L} = right-tailed critical value of X^{2}

Regression equation

Regression line

Given a collecgtion of paired sample data, the regression equation:

y(hat) = b_{0} + b_{1}*x

algebraically describes the relationship between the two variables x & y.

Th graph of the regression equation is called the regression line (or line of best fit, or least-squares line).

_{1}in the regression equation represents the marginal change in y that occurs when x changes by one unit.

For a pair of sample x & y values, the residual is the difference between the observed sample value of y and the y value that is predicted by using the regression equation. That is,

residual = observed y - predicted y = y - y"hat"

The amount of the variation in y that is explained by the regression line. It is computed as

r^{2} = explained variation/total variation

_{e}

A measure of the differences (or distances) between the observed sample y values & the predicted values y"hat" that are otained using the regression equation. It is given as

s_{e}= Square root of (sum (y - y"hat")^{2}/n - 2)

or

s_{e} = square root of (sumy^{2} - b_{0}sumy - b_{1}sumxy/n-2

_{1}, x

_{2, ...}x

_{k})

^{2}modified to account for the number of variables and the sample size.

_{0}

A statement that the value of a population parameter (such as proportion, mean, or standard deviation) is = some claimed value. (The term null is used to indicate no change or no effect or no difference).

Example: H_{0}: p = 0.5

We test the null hypothesis directly in the sense that we assume (or pretend) it is true & reach a conclusion to either reject it or fail to reject it.

_{1}or H

_{a}

A statement that the parameter has a value that somehow differs from the null hypothesis.

#### Words From Our Students

"StudyBlue is great for studying. I love the study guides, flashcards, and quizzes. So extremely helpful for all of my classes!"

Alice, Arizona State University

"I'm a student using StudyBlue, and I can 100% say that it helps me so much. Study materials for almost every subject in school are available in StudyBlue. It is so helpful for my education!"

Tim, University of Florida

"StudyBlue provides way more features than other studying apps, and thus allows me to learn very quickly! I actually feel much more comfortable taking my exams after I study with this app. It's amazing!"

Jennifer, Rutgers University

"I love flashcards but carrying around physical flashcards is cumbersome and simply outdated. StudyBlue is exactly what I was looking for!"

Justin, LSU