PAGE PAGE 16 Sampling and Statistics Statistics We start the discussion in the natural way. We all have a general feeling about what statistics is. In the course of these lecture notes, we will lay out the detail about what statistics is and how it is used. For now we give a quick definition. Suppose we have information on the test scores of students enrolled in a statistics class. In statistical terminology, the whole set of numbers that represents the scores of students is called “data set”, the name of each student is called an “element”, and the score of each student is called an “observation”. Data: Information from observations, outcomes, responses, measurements. Example: Lists of the prices of 25 recently sold homes, score of 15 students, and age of all employees of a company. Statistics is the study of how to collect, organizes, analyze, and interpret numerical information from data. Broadly speaking, applied statistics can be divided into two areas: Descriptive statistics and inferential statistics Descriptive statistics consists of methods for organizing, displaying, and describing data by using tables, graphs, and summary measures. Suppose we have information about the percentage of adults who carry different number of plastic cards. Number of cards Percentage of Adults 1 to 3 50 4 to 6 30 7 to 9 7 10 or more cards 13 sum 100 A data set in its original form is usually very large. Consequently, such a data is not very helpful in drawing conclusions or making decisions. So we reduce data to manageable size by constructing tables, drawing graphs, or calculating summary measure such as average. The portion of statistics that helps us to do this type of statistical analysis is called Descriptive statistics. Descriptive Statistics Inferential statistics consists of methods that use sample results to help make decisions or predictions about population. In statistics, the collection of all elements of interest is called a population. The selection of a few elements from this population is called a Sample. A major portion of statistics deals with making decisions, inferences, and predictions about populations based on results obtained from samples. For example, we may make decision s about the political views of all college and university students based on the political views of 1000 students selected from a few colleges and universities. The area of statistics that deal with such decision-making procedures is referred to as inferential statistics. Inferential Statistics Ha SHAPE \* MERGEFORMAT The collection of information from the elements of a population or sample is called a survey. A survey that includes every element of target population is called a census. The technique of collecting information from a portion of the population is called a sample survey. Sampling and Types of Data Population vs. Sample Typically, population data is very hard or even impossible to gather. Statisticians and researchers will instead extract data from a sample. There are several types of data that is of interest. We can classify data into two types: Numerical or Quantitative data is data where the observations are numbers. For example, age, height, on a scale from one to ten..., distance, number of ,... Categorical or Qualitative data is data where the observations are non-numerical. For example, favorite color, choice of politician, ... Parameter vs. Statistic A parameter is a numerical summery of the population, Such as mean, median, mode, range, variance, standard deviation. A statistic is a numerical summery of a sample taken from the population. More details in chapter 2 A sample that represents the characteristics of the population as closely as possible is called representative sample An example, to find average income of families living in New York City is by conducting sample survey, the sample must contain family who belongs to different income groups in almost the same proportion as they exist in the population. Random Samples When we conduct a survey we always attempt to achieve a random sample. A simple random sample of size n is one in which every possible subset of size n has equal chance of being selected. For example, to choose a random sample of 20 people with phone numbers, we can use a random number generator to randomly select 20 phone numbers. Caution: A simple random sample is almost always impossible to achieve in the real world. For example, using the phone number generator, we will only be able to collect data from those who have a phone, pick up the phone, and are willing to participate in the phone survey. Because of this most surveys have inherent flaws. However, a survey with a small flaw is better then no information. Many surveys are done using convenience sampling. For example a researcher stands outside a supermarket and interviews anyone eager to respond. One way to overcome the problem of obtaining a random sample is to use HYPERLINK "http://ltcconline.net/greenl/courses/201/projects/StratifiedSampling.htm" stratified sampling . Stratified sampling ensures that members of each strata (or type) are included in the survey. For example we may randomly select 50 Caucasians, 25 Hispanics, and 10 Philipinos from the Lake Tahoe community to ensure that the main three ethnic groups are represented. One problem with sampling is that often the researcher only gets respondents who are eager to be interviewed. One way to combat this is to use HYPERLINK "http://ltcconline.net/greenl/courses/201/projects/cluster_sampling.htm" cluster sampling . This process involves breaking the population into several groups or clusters. Some of the clusters are randomly selected and the researcher makes sure that every individual in the selected clusters are surveyed. This usually involves paying for the respondents to take the survey. A sample maybe random or nonrandom. In a random sample, each element of the population has the same chance of being included in the sample. One way to select a random sample is by lottery or draw. A simple example is when a teacher puts each student's name on a slip of paper and places in a hat and then draws names from the hat without looking. Variable Variables: A variable is a characteristic under study that assumes different values for different elements. In contrast to a variable, the value of a constant is fixed. Example of variables are the income of households, the makes of cars owned by people. A variable is often denoted by x; y; or z Some variables can be measured numerically, whereas others cannot. A variable that can assume numerical values is called a quantitative variable. The values that a certain quantitative variable can assume may be countable or non-countable. The key features to describe are the center and the spread (variability) of the data. For example what is a typical amount of precipitation? is there much variation from year to year? Ha For example, We can count the number of cars owned by a family (Discrete Variable) However, we cannot count the height of family members. (Continues Variable) Discrete Variable: A variable whose values are countable. Continuous Variable: A variable whose values can be assumed any numerical value over a certain interval or intervals. Continues: (Length, Age, Height, Weight, Time) Discrete: (Number of: Houses, Cars, Accidents) Variables that cannot be measured numerically but can be divided into different categories are called qualitative or categorical variables. Gendre: (Male and Female) Religious affiliation: (Catholic, Jewish, Muslim, Other, None) The key features to describe are the relative number of observations. For example percentages. Bar Charts, Frequency Distributions , and Histograms All of us heard the saying "a picture is worth thousand words." A graphical display can reveal at a glance the main characteristics of a data set. The bar graph and the pie chart are the two types of graphs used to display qualitative data. Frequency Distributions, Bar Graphs, and Circle Graphs (Pie Charts) The frequency of a particular event is the number of times that the event occurs. The relative frequency is the proportion of observed responses in the category. Example: We asked the students what country their car is from (or no car) and make a tally of the answers. Then we computed the frequency and relative frequency of each category. The relative frequency is computed by dividing the frequency by the total number of respondents. The following table summarizes. Country Frequency Relative Frequency US 6 0.3 Japan 7 0.35 Europe 2 0.1 Korea 1 0.05 None 4 0.2 Total 20 1 RELATIVE FREQUENCY EMBED Equation.DSMT4 EMBED Equation.DSMT4 For example: 6/20=0.3, 7/20=0.35, 2/20=0.1 and so on Below is a bar graph for the car data. Since the height represents the frequency. Notice that the widths of the bars are always the same. INCLUDEPICTURE "http://ltcconline.net/greenl/courses/201/descstat/histogram.gif" \* MERGEFORMATINET NOTE: Pareto chart is a special type of "bar graph.” It is a bar graph with categories ordered by their frequency, from the tallest bar to the shortest bar. We make a circle graph often called a pie chart of this data by placing wedges in the circle of proportionate size to the frequencies. Below is a circle graph the shows this data. INCLUDEPICTURE "http://ltcconline.net/greenl/courses/201/descstat/hist.h2.gif" \* MERGEFORMATINET To find the angles of each of the slices we use the formula Frequency Angle = x 360 Total For example to find the angle for US cars we have 6 Angle = x 360 = 108 degrees 20 Graph for Quantitative Variables Data can be displayed in a histogram, a dot plot or stem-and-leaf plot. Dot Plots A dot plot shows a dot for each observation, placed just above the value on the number line for the observations. To construct a dot plot, Draw a horizontal line. Label it with the name of the variable, and mark regular values of the variable on it. For each observation, place a dot above its value on the number line. The dot plot portrays the individual observations. The number of dots above a value on the number line represents the frequency of occurrence of that value. From a dot plot, we would be able to reconstruct (at least approximately) all the data in the sample. Example 1: 2, 3, 3, 6, 7, 7, 7, 7, 8, 8, 8, 8, 8, 9, and 9 * ** * ** * *** *** ** ---------------------------------------------------------------- 0 1 2 3 4 5 6 7 8 9 Histograms Histograms are bar graphs whose vertical coordinate is the frequency count and whose horizontal coordinate corresponds to a numerical interval. Example: The depth of clarity of Lake Tahoe was measured at several different places with the results in inches as follows: 15.4, 16.7, 16.9, 17.0, 20.2, 25.3, 28.8, 29.1, 30.4, 34.5, 35.2, 36.7, 39.1, 39.4, 39.6, 39.8, 40.1, 42.3, 43.5, 45.6, 45.9, 48.3, 48.5, 48.7, 49.0, 49.1, 49.3, 49.5, 50.1, 50.2, 52.3 We use a frequency distribution table with class intervals of length 5. Class Interval Frequency Relative Frequency Cumulative Relative Frequency 15 -<20 4 0.129 0.129 20 -<25 1 0.032 0.161 25 -< 30 3 0.097 0.258 30 -< 35 2 0.065 0.323 35 -< 40 6 0.194 0.516 40 -< 45 3 0.097 0.613 45 -< 50 9 0.290 0.903 50 -< 55 3 0.097 1.000 Total 31 1.000 Below is the graph of the histogram The Shape of a Histogram A histogram is unimodal if there is one hump, bimodal if there are two humps and multimodal if there are many humps. A non-symmetric histogram is called skewed if it is not symmetric. INCLUDEPICTURE "http://ltcconline.net/greenl/courses/201/descstat/symHist.gif" \* MERGEFORMATINET Unimodal, Symmetric, Nonskewed INCLUDEPICTURE "http://ltcconline.net/greenl/courses/201/descstat/SkewHist.gif" \* MERGEFORMATINET Non-symmetric, Skewed Right INCLUDEPICTURE "http://ltcconline.net/greenl/courses/201/descstat/BimodalHist.gif" \* MERGEFORMATINET Bimodal Descriptive Statistics and Stem and Leaf Diagrams Stem and Leaf Diagrams For data that we want to understand how it looks without losing the individual data points, we use a stem and leaf diagram. To construct a stem and leaf diagram, we put the first digit or more (the stem) on the left and that digit's corresponding list (leaf) on the right. We can also have the high and low of the digit. If we want to compare two data sets we can draw the digits in the middle, the first set of leaves on the right, and the second set of leaves on the left. This is useful for comparing two data sets. A comparative stem and leaf diagram is often used. The middle represents the stems, and the left and right sides are the leaves of each of the two data sets. Example A computer retailer collected data on the number of computers sold during 20 consecutive Saturdays during the year. The results are as follows: 12, 14, 14, 17, 21, 24, 24, 25, 25, 26, 26, 27, 29, 31, 34, 35, 36, 39, 40, 42, 42, 45, 46, 47, 49, 49, 56, 59, 62 We can put this data into a stem and leaf diagram as shown below. The first digit represents the stem and the second digit represents the leaf. The stem is written on the left hand side (once per value) and the leaf is written on the right hand side next to the corresponding stem. 1| 2 4 4 7 2| 1 4 4 5 5 6 6 7 9 3| 1 4 5 6 9 4| 0 2 2 5 6 7 9 9 5| 6 9 6| 2 It is easy to see the shape of the distribution without losing any of the individual data. To read the stem and leaf diagram, for example the first row corresponds to all the data from 10 to 17 Cross-Section vs. Time-Series Data Cross-Section Data: contain information on different elements of population or sample for the same period of time. Example: The following table shows the 1998 earning of six celebrities. Celebrity 1998 Earning (millions of dollars) Jerry Seinfeld 267 Steven Spielberg 175 Operah Winfrey 125 Michael Jordan 69 Master P. 56.5 Eddie Murphy 47.5 The Time-Series data contain information on same element of population or sample for the different periods of time. Example: The following table shows the average salaries of all major baseball players for the year 1995 through 1999. Year Average Salary 1995 $ 1,094,440 1996 $ 1,101,455 1997 $ 1,314,420 1998 $ 1,384,530 1999 $ 1,567,873 Mean, Mode, Median, and Standard Deviation The Mean and Mode The sample mean is the average and is computed as the sum of all the observed outcomes from the sample divided by the total number of events. We use x as the symbol for the sample mean. In math terms, INCLUDEPICTURE "http://www.ltcconline.net/greenl/courses/201/descstat/mean.h1.gif" \* MERGEFORMATINET where n is the sample size and the x correspond to the observed valued. Example Suppose you randomly sampled six acres in the Desolation Wilderness for a non-indigenous weed and came up with the following counts of this weed in this region: 34, 43, 81, 106, 106 and 115 We compute the sample mean by adding and dividing by the number of samples, 6. 34 + 43 + 81 + 106 + 106 + 115 = 80.83 6 We can say that the sample mean of non-indigenous weed is 80.83. The mode of a set of data is the number with the highest frequency. In the above example 106 is the mode, since it occurs twice and the rest of the outcomes occur only once. The population mean is the average of the entire population and is usually impossible to compute. We use the Greek letter EMBED Equation.DSMT4 for the population mean. Median One problem with using the mean, is that it often does not depict the typical outcome. If there is one outcome that is very far from the rest of the data, then the mean will be strongly affected by this outcome. Such an outcome is called and outlier. An alternative measure is the median. The median is the middle score. If we have an even number of events we take the average of the two middles. The median is better for describing the typical value. It is often used for income and home prices. Example Suppose you randomly selected 10 house prices in the South Lake Tahoe area. Your are interested in the typical house price. In $100,000 the prices were 2.7, 2.9, 3.1, 3.4, 3.7, 4.1, 4.3, 4.7, 4.7, 40.8 If we computed the mean, we would say that the average house price is 710,000. Although this number is true, it does not reflect the price for available housing in South Lake Tahoe. A closer look at the data shows that the house valued at 40.8 x $100,000 = $4.08 million skews the data. Instead, we use the median. Since there is an even number of outcomes, we take the average of the middle two 3.7 + 4.1 = 3.9 2 The median house price is $390,000. This better reflects what house shoppers should expect to spend. Example: At a ski rental shop data was collected on the number of rentals on each of ten consecutive Saturdays: 44, 50, 38, 96, 42, 47, 40, 39, 46, 50. To find the sample mean, add them and divide by 10: 44 + 50 + 38 + 96 + 42 + 47 + 40 + 39 + 46 + 50 = 49.2 10 Notice that the mean value is not a value of the sample. To find the median, first sort the data: 38, 39, 40, 42, 44, 46, 47, 50, 50, 96 Notice that there are two middle numbers 44 and 46. To find the median we take the average of the two. 44 + 46 Median = = 45 2 Notice also that the mean is larger than all but three of the data points. The mean is influenced by outliers while the median is robust. Outlier: An outlier is an observation that falls well above or well below the overall bulk of the data. Example 1: 22, 34, 68, 75, 79, 79, 81, 83, 84, 87, 90, 92, 96, and 156 156 is an outlier Example 2: 5, 34, 68, 75, 79, 79, 81, 83, 84, 87, 90, 92, 96, and 99 5 is an outlier Range The Range is difference between the largest and smallest value in a set of data. For example: 1,3,4,5,5,6,7,11 Range = 11-1=10 Variance, Standard Deviation and Coefficient of Variation The mean, mode, median, and trimmed mean do a nice job in telling where the center of the data set is, but often we are interested in more. For example, a pharmaceutical engineer develops a new drug that regulates iron in the blood. Suppose she finds out that the average sugar content after taking the medication is the optimal level. This does not mean that the drug is effective. There is a possibility that half of the patients have dangerously low sugar content while the other half has dangerously high content. Instead of the drug being an effective regulator, it is a deadly poison. What the pharmacist needs is a measure of how far the data is spread apart. This is what the variance and standard deviation do. First we show the formulas for these measurements. Then we will go through the steps on how to use the formulas. We define the variance to be INCLUDEPICTURE "http://www.ltcconline.net/greenl/courses/201/descstat/mean.h2.gif" \* MERGEFORMATINET and the standard deviation to be INCLUDEPICTURE "http://www.ltcconline.net/greenl/courses/201/descstat/mean.h3.gif" \* MERGEFORMATINET Variance and Standard Deviation: Step by Step Calculate the mean, x. Write a table that subtracts the mean from each observed value. Square each of the differences. Add this column. Divide by n -1 where n is the number of items in the sample This is the variance. To get the standard deviation we take the square root of the variance. Example The owner of the Ches Tahoe restaurant is interested in how much people spend at the restaurant. He examines 10 randomly selected receipts for parties of four and writes down the following data. 44, 50, 38, 96, 42, 47, 40, 39, 46, 50 He calculated the mean by adding and dividing by 10 to get x = 49.2 Below is the table for getting the standard deviation: x x - 49.2 (x - 49.2 )2 44 -5.2 27.04 50 0.8 0.64 38 11.2 125.44 96 46.8 2190.24 42 -7.2 51.84 47 -2.2 4.84 40 -9.2 84.64 39 -10.2 104.04 46 -3.2 10.24 50 0.8 0.64 Total 2600.4 Now 2600.4 = 288.7 10 - 1 Hence the variance is 289 and the standard deviation is the square root of 289 = 17. What this means is that most of the patrons probably spend between $32.20 and $66.20. The sample standard deviation will be denoted by s and the population standard deviation will be denoted by the Greek letter EMBED Equation.DSMT4 EMBED Equation.DSMT4 . The sample variance will be denoted by s2 and the population variance will be denoted by EMBED Equation.DSMT4 2. The variance and standard deviation describe how spread out the data is. If the data all lies close to the mean, then the standard deviation will be small, while if the data is spread out over a large range of values, s will be large. Having outliers will increase the standard deviation. One of the flaws involved with the standard deviation, is that it depends on the units that are used. One way of handling this difficulty, is called the coefficient of variation which is the standard deviation divided by the mean times 100% EMBED Equation.DSMT4 CV = 100% EMBED Equation.DSMT4 In the above example, it is 17 100% = 34.6% 49.2 This tells us that the standard deviation of the restaurant bills is 34.6% of the mean. Chebyshev's Theorem A mathematician named Chebyshev came up with bounds on how much of the data must lie close to the mean. In particular for any positive k, the proportion of the data that lies within k standard deviations of the mean is at least 1 1 - k2 For example, if k = 2 this number is 1 1 - = .75 22 This tell us that at least 75% of the data lies within 75% of the mean. In the above example, we can say that at least 75% of the diners spent between 49.2 - 2(17) = 15.2 and 49.2 + 2(17) = 83.2 dollars. EMBED Equation.DSMT4 and for Grouped Data Calculating the Mean from a Frequency Distribution Since calculating the mean and standard deviation is tedious, we can save some of this work when we have a frequency distribution. Suppose we were interested in how many siblings are in statistics students' families. We come up with a frequency distribution table below. Number of Children 1 2 3 4 5 6 7 Frequency 5 12 8 3 0 0 1 Notice that since there are 29 respondents, calculating the mean would be very tedious. Instead, we see that there are five ones, 12 twos, 8 threes, 3 fours, and 1 seven. Hence the total count of siblings is 1(5) + 2(12) + 3(8) + 4(3) + 7(1) = 72 Now divide by the number of respondents to get the mean. 72 EMBED Equation.DSMT4 = = 2.5 29 Extending the Frequency Distribution Table Just as with the mean formula, there is an easier way to compute the standard deviation given a frequency distribution table. We extend the table as follows: Number of Children (x) Frequency (f) xf x2f 1 5 5 5 2 12 24 48 3 8 24 72 4 3 12 48 5 0 0 0 6 0 0 0 7 1 7 49 Totals EMBED Equation.DSMT4 = 29 EMBED Equation.DSMT4 = 72 EMBED Equation.DSMT4 = 222 Next we calculate EMBED Equation.DSMT4 EMBED Equation.DSMT4 Now finally apply the formula INCLUDEPICTURE "http://www.ltcconline.net/greenl/courses/201/descstat/meanSD1.gif" \* MERGEFORMATINET to get INCLUDEPICTURE "http://www.ltcconline.net/greenl/courses/201/descstat/meanSD2.gif" \* MERGEFORMATINET EMBED Equation.DSMT4 Weighted Averages Sometimes instead of the simple mean, we want to weight certain outcomes higher then others. For example, for your statistics class, the following percentages are given Homework = 150 Midterm = 450 Project = 100 Final = 300 Suppose that you received an 84% on your homework, a 96% on your midterms, a 98% on your project and an 78% on your final. What is your average for you class? To compute the weighted average, we use the formula EMBED Equation.DSMT4 We have EMBED Equation.DSMT4 and EMBED Equation.DSMT4 Now divide to get your weighted average 900.5 = .9005 1000 You squeaked by with an "A". Percentiles and Box Plots Percentiles We saw that the median splits the data so that half lies below the median. Often we are interested in the percent of the data that lies below an observed value. We call the rth percentile the value such that r percent of the data fall at or below that value. Example If you score in the 75th percentile, then 75% of the population scored lower than you. Example Suppose the test scores were 22, 34, 68, 75, 79, 79, 81, 83, 84, 87, 90, 92, 96, and 99 If your score was the 75, in what percentile did you score? Solution There were 14 scores reported and there were 4 scores at or below yours. We divide 4 100% = 29 14 So you scored in the 29th percentile. There are special percentile that deserve recognition. The second quartile (Q2) is the median or the 50th percentile The first quartile (Q1) is the median of the data that falls below the median. This is the 25th percentile The third quartile (Q3) is the median of the data falling above the median. This is the 75th percentile We define the interquartile range as the difference between the first and the third quartile IQR = Q3 - Q1 Range: The range is difference between the largest and the smallest observations. Example : Range = 99 -22 =77 Box Plots Another way of representing data is with a box plot or the five-number summary (minimum value, first quartile; Q1, median, third quartile; Q3, and the maximum value). To construct a box plot we do the following: Draw a rectangular box whose bottom is the lower quartile (25th percentile) and whose top is the upper quartile (75th percentile). Draw a horizontal line segment inside the box to represent the median. Extend horizontal line segments ("whiskers") from each end of the box out to the most extreme observations. Example: Suppose the test scores were 22, 34, 68, 75, 79, 79, 81, 83, 84, 87, 90, 92, 96, and 99 Minimum value = 22 First quartile Q1 = 75 the first quartile is the median of 7 smallest observations Median Q2 = 82 the median of 14 values is the average of 7th and 8th observations The third Quartile Q3 = 90 the third quartile is the median of 7 largest observations Maximum value = 99 Q1=75 Median = 82 Q3 = 90 SHAPE \* MERGEFORMAT Minimum Value = 22 Maximum Value = 99