High School: Statistics and Probability
High School: Statistics and Probability
Interpreting Categorical and Quantitative Data HSS-ID.A.3
3. Interpret differences in shape, center, and spread in the context of the data sets, accounting for possible effects of extreme data points (outliers).
Students should already know that in order to describe and compare sets, we need to define the center and the spread of the data. These summaries are good first steps, but there is one more measure of the data that can give another clue and help us compare data sets: the shape of the data.
Students should know that mound-shaped distributions can be either symmetric or skewed either left of right, but a normal distribution is a mound shaped, symmetric curve. Students should also know the relation of mean and median to symmetric or skewed curves. If the mean is greater than the median, the data is skewed right. If the mean is less than the median, the data is skewed left. If the mean and median are equal, then the data is symmetric.
Students should realize that the shape of the data helps us find and identify outliers. An outlier is something that sticks out from the rest of the data, like an egg with two yolks. It's a data point that makes you furrow your brow and wonder if you measured wrong. Formally, an outlier is a data point that has an "extreme value" when compared with the rest of the data set.
Mathematically speaking, an outlier is defined as any point that falls 1.5 times the IQR below the lower quartile or 1.5 times the IQR above the upper quartile. To visualize what this means, we can use box plot with the data below. First, we sort the data from smallest to largest to find the lower quartile (Q1), median, and upper quartile (Q3).
Data: 37, 37, 38, 38, 40, 40, 42, 42, 42, 62
The median is 40.
Q1 = 38
Q3 = 42
Therefore, IQR = Q3 – Q1 = 42 – 38 = 4.
The box plot then looks like this:
If IQR = 4, then the lower limit on outliers is Q1 – 1.5 × IQR = 38 – 1.5 × 4 = 32 and the upper limit on outliers is Q3 + 1.5 × IQR = 42 + 4 × 1.5 = 48. We can add these as vertical lines in the box plot.
We can see that 62 is an outlier because it surpasses these limits. When there is an outlier on one side of the data set, we can chop the whisker off at the limit and then record the outliers as data points. So, the final box plot for this data set would look like this:
Students should understand that removing this outlier changes the mean significantly, but not the median. The absence or presence of outliers may make either the mean or median more representative of the center of data, and students should be able to choose which is more preferable depending on the data. They should also be able to identify outliers by calculating the limits based on the IQR, and give reasonable explanations for why outliers might exist within a particular context.
Here's a video resource that can be used by teachers to help explain normal distribution curve.