Lec02: Descriptive Statistics IOE 265 W10 1 Descriptive Statistics 1 Topics I Concept of Location and Dispersion. of Location and II. Measures of Location III. Measures of Dispersion IV Box Plots 2 . Plots Lec02: Descriptive Statistics IOE 265 W10 2 I. Location and Dispersion Most common descriptive statistics are related to either measuring location or dispersion (variation). Location ~ central tendency Dispersion ~ spread of distribution Classic example to demonstrate these concepts: 3 example demonstrate these Outcomes of Throwing Darts On or Off Location Low or High Dispersion Lecture Exercise: Identify On/Off Target & High/Low Dispersion for each x xx x x x x x x B. __________A. _________ 4 x x x D. __________C. __________ Lec02: Descriptive Statistics IOE 265 W10 3 Target / dispersion analysis and general problem solving First, address problems in order of importance. Highest Priority – address features that have strongest cause- effect relationship with end-customer satisfaction. Next, we typically try to reduce dispersion, then shift mean to target as necessary to meet end- customer needs. 5 Stabilize process Center Process as necessary II. Measures of Location Mean Median Trimmed Mean 6 Lec02: Descriptive Statistics IOE 265 W10 4 Mean Mean (also known as the average) is a measure of (g the center of a distribution. Typical notation used to represent the mean of a sample of data is ; Greek letter is used to represent the mean of a population. N XXX Mean N ... 21 X 7 Example: suppose five students take a test and their scores are 70, 68, 71, 69 and 98. Mean = (70+68+71+69+98)/5 = 75.2 Median Median (also known as the 50 th percentile) is the middle observation in a data set. Rank the data set and select the middle value. If odd number of observations, the middle value is observation [N + 1] / 2. If even number of observations, the middle value is extrapolated as midway between observation numbers N / 2 and [N / 2] + 1. Prior data values: 68 69 70 71 and 98 8 data values: , , , , 98. Median is 70. If another student with a score of 60 was included, the new median would result in 69.5 (69 + 70 / 2). Lec02: Descriptive Statistics IOE 265 W10 5 Mean Vs. Median Which is a better measure of location for the following set of test scores? 68, 70, 69, 71, and 98 Mean = 75.2 Median = 70.0 9 Trimmed Mean Trimmed Mean is a compromise between mean and median. 10% Trimmed Mean First, eliminate smallest 10% of values and largest 10% of values. Then, re-compute the mean. 10 Trimmed means – gaining popularity Less sensitive than the mean to outliers, but not as robust as the median value. Lec02: Descriptive Statistics IOE 265 W10 6 Trimmed Mean (Example from Devore Textbook) Variable: life (hours) of incandescent lamps. Sample size = 20 How many values will be trimmed in 10% TM? Mean = 965.0 Median = 1009.5 Trim Mean = 971.4 How are these values impacted by sample size, by distribution? 11 What might be some useful applications? III. Measures of Dispersion Range Standard Deviation Variance 12 Lec02: Descriptive Statistics IOE 265 W10 7 Range Range is the maximum value in a data set minus the minimum value value. Example: Test Scores: 70, 68, 71, 69 and 98. Range = 98 - 68 = 30. 13 Note: the range is often preferred over the standard deviation for small data sets (e.g., if # of observations for a sample data set < 10). Standard Deviation Sample Standard deviation (S D ) S ht ev , measures t e dispersion of the individual observations from the mean. For a sample data set, standard deviation is also referred to as the sample standard deviation or the root-mean-square S rms 1 1 2 n XX S n i i 14 Units for S are the same as for the variable being analyzed. E.g., if we measure mpg, then S will be in mpg. Lec02: Descriptive Statistics IOE 265 W10 8 Why divide by n-1? To correct an estimating error – we’ll cover this in chapter 6 in detail (point estimation theory)p(p y) What you should know now: n – 1 is referred to as the “degrees of freedom”. Degrees of freedom (dof) are a measure of the amount of information from the sample data that has been used in estimating a sample statistic Every time a statistic is calculated from a sample one degree 15 time is calculated from , degree of freedom is used up So, when we calculate the sample std deviation, we divide by n-1 because the sample mean (Xbar) has to be calculated first and this calculation uses 1 dof 1 1 2 n XX S n i i Effects of Extreme Values Test scores: 70, 68, 71, 69 and 98, sample standard deviation is 12.79. Suppose you exclude the score of 98, sample standard deviation is reduced to 1.3! Standard deviation may be severely influenced by extreme values in sample data set (Note; 16 (Note; these values may not necessarily be mistakes). We may reduce the effects of any individual observation by increasing the sample size. Lec02: Descriptive Statistics IOE 265 W10 9 Variance Variance is the square of the standard deviation. Represents the average squared deviation of each average deviation observation from the sample mean. 1 )( 2 1 2 n XX S n i i 17 Prior Example where std deviation = 12.79 Variance = (12.79) 2 = 163.72 Skewness Some software packages provide skewness* skewness Skewness is a measure of relative symmetry. Zero indicates symmetry Positive skewness show a long right tail Negative skewness show a long left tail 18 long left tail *Actual calculation outside scope of class Lec02: Descriptive Statistics IOE 265 W10 10 Kurtosis Some software packages provide kurtosis* kurtosis Kurtosis (K) is a measure of peakedness of a distribution (relative to normal). K = 3 normal, bell-shaped distribution (mesokurtic) --(Note: some software: normal=0) K < 3 (or negative relative to 0) flatter peak 19 (or , fatter shoulders, shorter tails K > 3 (or positive relative to 0) more peaked than normal with longer tails *Actual calculation outside scope of class Using Software to Calculate Descriptive Statistics In practice we rarely calculate statistics by , statistics by hand. So, let us explore some useful Excel functions. Mean =average(array) Median =median(array) Std Dev =stdev(array) 20 (y) Variance =var(array) Range =max(array)-min(arrary) Lec02: Descriptive Statistics IOE 265 W10 11 Minitab Results Of course, all advanced statistical software will automatically compute descriptive statistics. Descriptive Statistics: Score Variable N Mean Median TrMean StDev SE Mean Score 16 82.78 83.50 83.32 9.17 2.29 21 Variable Minimum Maximum Score 63.00 95.00 IV. Box Plots Mild Outlier(s) Q3 ≈ 75 th Percentile Median 50 th Percentile Extreme Outlier(s) * * Upper Whisker: Highest value within upper limit Median Third quartile (Q3) or Upper fourth Q1 ≈ 25 th Percentile f s = Q3 – Q1 Upper Limit: Q3 + 1.5 f s LLiit 22 First quartile (Q1) or Lower fourth Lower Limit: Q1 – 1.5 f s * Lower Whisker: Lowest value within lower limit < extreme outlier Q +/- 1.5 f s < Q +/- 3.0 f s < mild outlier Lec02: Descriptive Statistics IOE 265 W10 12 Box Plots differences in notation/calculation Minitab calculates quartiles (Q1, Q3) Sttbk(ildiD )ftl Some textboo s (inc uding evore re er o ower fourth and upper fourth Roughly the same, but with some differences: Lower fourth={median of the smallest n/2 obs, n even OR median of the smallest (n+1)/2, n odd} Q1 – observation at position (n+1)/4 (if not an integer then interpolate) 23 Upper fourth= ={median of the largest n/2 obs, n even OR median of the largest (n+1)/2, n odd} Q3 – observation at position 3(n+1)/4 (if not an integer then interpolate) Box Plot Information Box Plot Shows: Location – line for median Note: some software will also include a dot for mean. Dispersion –box shows the 25 th –75 th percentile value range. Departures from symmetry – one box or whisker can be larger than the other side suggesting a lack 24 be larger than the side suggesting of symmetry. Identification of mild and extreme outliers. Lec02: Descriptive Statistics IOE 265 W10 13 Box Plot - MPG Example Boxplot of MPG 25 23222120191817 MPG Box Plots Vs. Histogram Note: wider box to left of median in box plot t d t l ft th i htsugges s more sprea o e an r ght. Similar pattern is shown in the histogram. 15 Histogram of MPGBoxplot of MPG 26 17 18 19 20 21 22 23 0 5 10 MPG F r equenc y Median = 20.1 23222120191817 MPG Median = 20.1 Lec02: Descriptive Statistics IOE 265 W10 14 Multiple Box Plot Example For MPG data, suppose you also collected data for tire pressures (grouped: as normal or low) Does this stratification variable help explain bi-modal distribution? 27 Summary of concepts… Most common descriptive statistics are related to either measuring location or dispersion (variation) me . Location ~ central tendency (mean, median, trimmed mean) Dispersion ~ spread of distribution (range, standard deviation and variance) Extreme observations (or outliers) can have an important effect on some of these statistics BPlt th hi l t l th t h l 28 Box Plots are ano er grap ca oo tha can e p us identify extreme observations and distribution shapes Luis Microsoft PowerPoint - Lec02-ioe265w10 [Compatibility Mode]