 # The Mean sometimes can be very mean!

Author: Akhilesh Krishnan, SAS Programmer, Genpro Research

The arithmetic mean, commonly called an average, is a very popular measure of the central value. It is the tool that describes a dataset, which allows for comparison. The statistical nature of the arithmetic mean provides a means for drawing conclusion about the population or process flow from which the data originated. The statistical concept of the arithmetic mean utilizes a quantitative entity to represent, locate, qualify, describe and interpret a dataset.

But there are situations when the arithmetic mean can produce insensible results. Suppose a survey on the number of children of women belonging to an age category is conducted, and the average number of children is 2.6. How can this be? Can anyone have 2.6 number of children! This is an example where the mean can sometimes be misleading. The question that will now arise will be the remedy to this situation. So, we think of other measures of central tendency like median, mode. In the above said example, the median comes into picture, the number higher than half the observations. The median number of children per woman in this case was 2, that is 50% of women have 0,1 or 2 children and 50% have 2,3 or 4 children. Hence median becomes more meaningful in this example.

Then why stick on to the median? The quick answer is that the mean or standard deviation can be used to answer a lot of questions, especially while dealing with real data. For this case, suppose there is a dataset consisting of heights from a sample of 15-year-old boys. We know that by the empirical rule, “95% of the observation are within two standard deviations of the mean”. With the mean and standard deviation, many questions can be answered such as: What height is exceeded by only 5% of boys? What proportion of boys are more than 5 feet tall? What are the heights between which 50% of the boys’ heights can be found?

Along with the above-mentioned advantages and disadvantages, another obstacle that the arithmetic mean face is the presence of outliers in the datasets. Outliers are those values of observations that differs significantly from the other observations. They mainly arise due to variability in measurements or experimental errors. The method for comparing several means is called the analysis of variance, abbreviated by ANOVA. If present in the dataset, they tend to increase the estimate of sample variance, thus decreasing the calculated F statistic for the ANOVA and lowering the chance of the null hypothesis. In most cases we often remove the outliers.

The two common approaches to exclude outliers are truncation (or trimming) and winsorising. The truncated mean is calculated after discarding the high and low endpoints. It uses more information from the sample than the median, but unless the underlying distribution is symmetric, the truncated mean of a sample is unlikely to produce an unbiased estimator for the mean. Winsorised mean is calculated after replacing the high and low values at the end with the most extreme remaining values. Similarly, like the trimmed mean, winsorized mean also uses more information from the distribution or sample than the median. However, unless the underlying distribution is symmetric, the winsorized mean of a sample is unlikely to produce an unbiased estimator for the mean.

The arithmetic mean can also be used to model concepts outside of statistics. In a physical sense, the arithmetic mean can be thought of as the center of gravity. From the mean of a data set we can think of the average distance the data points are from the mean as standard deviation. The square of standard deviation (i.e. variance) is analogous to the moment of inertia in the physical model.

The seemingly simple calculation is a relatively complex concept that is most often developed as an “add-them-up-and-then-divide” mathematical procedure, rather than as a statistically representative concept. The conceptual knowledge related to the arithmetic mean is not purely mathematical, rather it is a combination of conceptual knowledge of mathematics and conceptual knowledge of statistics.

Statistics provides mathematics with a basis of contextually rich real-world problems that can be used to contextualize the mathematics. Conversely, mathematics is a tool utilized by statistics to quantify statistical concepts. An understanding of their inter-disciplinary knowledge connections may advance the symbiosis between mathematics and statistics.

The arithmetic mean, as said above can be used to answer a lot of questions, also if used without caution, it can be meaningless as well. Such situations can only be understood if it is compared with the respective real-life scenario. In general, statistics and its applications must be used with caution.