An insight into Descriptive Statistics (Part I : Measures of Central Tendency)
By ARKAJIT BANERJEE
Dear reader , If you are reading this article then I am sure you must be wanting to learn what exactly Descriptive Statistics is and what are Measures of Central Tendency . Fine !! Then this detailed article is just for you . I request you to go through the full article with patience and when you reach the last line … I hope you will have your answers .
Before getting to know what Statistics means or deals with, let us understand what is data . Data is a piece of information that has been collected or generated with regard to a particular topic or field of enquiry. Now this data may have excessive redundant materials or may even have inadequate information relevant to the field of enquiry. Thus if data is unorganized or in some way ambiguous , we may never have a chance to bring out the true story which the data wants to convey , whatsoever perfect technique or tools we use for analysis. Thus, the first and foremost concern before we step into the world of analytics is to prepare the data in an organized manner , have confidence on the reliability of the data and have clear and unambiguous definition of terms. Having understood what data means now let us introduce Statistics in our discussion.
This definition by Croxton and Cowden is one of the most comprehensive, scientific, logical definitions to understand what Statistics actually is “Statistics may be defined as the science of collection, presentation, analysis and interpretation of numerical data”.
Statistics in my words , is a collection of analytic and interpretable methods which possess its own techniques and takes the help of Mathematics and Probability Theory to derive some special features of the data ( commonly known as Descriptive Statistics ) and, to come up with certain conclusions or inferences from the data ( commonly known as Inferential Statistics ).
In this article, I have given a detailed concept or understanding of what actually is Descriptive Statistics and the explanation of the first measure of Descriptive Statistics. The other methods will be discussed in details one by one in my upcoming articles.
Descriptive Statistics: As the name suggests, this branch of Statistics is focused on describing the data that is , to summarize the information contained in the data and highlight certain special features which the data has. Quantitative data possess some characteristic features which describe certain aspects of it namely Measures of Central Tendency, Measures of Dispersion, Measures of Skewness and Measures of Kurtosis.
Central Tendency, according to my thinking is a feature or behavior of the data to tend towards a particular value of the distribution, usually in most cases towards a central value. In other words, the tendency of the data to concentrate or accumulate towards a central value is known as Central Tendency. Typically it is known as average too.
The three most predominantly used averages are the measures of Central Tendency. They are
1) Mean or Simple Arithmetic Mean or Arithmetic Mean (AM) or Average.
2) Median or Positional Average.
The other two measures less frequently used are Geometric Mean (GM) and Harmonic Mean (HM).
Mean: The Arithmetic mean (AM) or simply mean of a sample data set is simply the sum of all the individual observations of the data set divided by the total number of observations.
The mathematical form of the sample data set is given by :
Example: Let us take a simple data set: 1,3,4,5
If we calculate the mean, it comes out to be (1+3+4+5)/4 = 13/4 = 3.25
What does this mean? This means that the tendency of data is to concentrate between 3 and 4.
Median or Positional Average : Frankly speaking, the median is nothing but the middlemost value of the data set or the distribution (provided the observations in the data set are arranged in ascending or descending order). In other words, the median is a quantile of order 1/2 such that half of the values are lesser than or equal to it and half of the values are greater than or equal to it (A quantile of order ‘p’ is a value of the variable such that a proportion p of the total number of values are less than or equal to it and the remaining proportion (1-p) are greater than or equal to it). So if there are ’n’ number of observations, and ’n’ is odd, median = ((n+1)/2) th observation and if ’n’ is even, median = arithmetic mean (AM) of (n/2)th value and ((n/2)+ 1)th value.
Example: Let us take the previous dataset as it is already arranged in ascending order). If your dataset is not arranged I strongly urge you to order it before computing the median. So, our dataset was: 1,3,4,5. There are four observations here. So n (no. of observations) = 4. So median would be AM of (4/2)th and ((4/2)+1) th observation that is AM of the 2nd and the 3rd observation that is (3+4)/2= 7/2= 3.5
If suppose we add another observation to the data set let us say 100 then n becomes 5.So our new dataset becomes 1,3,4,5,100 and here the middlemost value turns out to be 4 thus median becomes 4.Now here is a very important thing which I want you to feel. Four out of the five observations are below 10 and suddenly the last observation becomes 100. Now pause your reading here and calculate the mean , How much is it ? It comes out to be (1+3+4+5+100)/5 = 22.6 , however if you closely look at the dataset you will definitely see there is no such value in the range of 20 to 30. Due to the presence of 100 , the mean has shifted itself to such a value which is not itself present in the dataset . The mean in this case is telling you that the data is concentrated towards a value which is 20 however the majority of the numbers which the dataset comprise are single digits !!! Therefore we come to the point that due to the presence of an outlier value like 100 we are getting an anomaly with mean. Now what actually is an outlier ? An outlier is such a value in the dataset which lies in an abnormally large distance from the other values in the dataset or simply differs from the other values in the dataset like in our case all the four are single digit numbers and then suddenly I have thrown a large three digit number into this dataset so in this case 100 becomes an outlier value. So when we deal with outliers , the simple mean becomes inefficient since mean takes into consideration all the values in the dataset , an abnormal value in the dataset makes a huge impact on the result of the mean. Thus we need to counter outliers effectively and below are three simple ways to do them.
METHOD 1: Calculate MEDIAN after arranging the dataset and present the middlemost value as Central Tendency of the distribution.
METHOD 2: Eliminate the extreme or outlier values ,to be more specific discard all values lower than the first quartile (Q1) and discard all values higher than the third quartile (Q3) and calculate the Arithmetic Mean on the modified dataset as you have now deleted any outliers by deleting values lower than (Q1) or higher than (Q3) . Thus you are free to calculate AM now .This method is called TRIMMED MEAN.
METHOD 3: Replace each value lower than the first quartile with the first quartile itself and also replace each value higher than the third quartile with the third quartile itself and then calculate the AM on this modifies data set . This method is known as WINSORIZED MEAN … so here also you have created a modified and outlier free dataset .
Hurray !! You have got your Central Tendency value despite getting an outlier.
But I personally recommend you to go with either Method 2 or 3 when you deal with large datasets because mean is generally least affected by sampling fluctuations although in some cases median can be superior.
Mode: The simplest of all the measures of Central Tendency, mode is defined as that value of the variable having the highest frequency or in layman terms the value of the variable which is repeated the maximum number of times or you may also say the value of the variable which appears to be the most predominant one in the dataset.
A simple but extremely useful empirical relationship is
Mode = 3 Median-2 Mode
Geometric mean or GM : Speaking in mathematical terms , GM of a dataset is simply the n th root of the product of all observations in the dataset or the (1/n) th power of the product of all the observations.We generally use GM when we are interested in finding out the rate of growths such as growth of population be it the human population or be it the corona virus population !!!
NOTE : x1,x2,…,xn actually mean x suffix 1,x suffix 2,….,x suffix n respectively.
The mathematical representation of GM if the dataset consists of observations like x1,x2,….,xn is
GM = (x1 • x2….. • xn )to the power of (1/n )
Harmonic mean or HM: One of the less commonly used averages, HM of a number of observations x1,x2,….,xn ( provided none of these individual observations are zero ) is equal to the reciprocal of the arithmetic means of the reciprocal of the given values. I know reading this HM definition once will not make sense to many …. Don’t worry guys … I will not let you go from my article without making you understand what I wanted to say … Now listen to what I say carefully
Take the dataset which has observations like x1,x2,….,xn . Get the AM done with this dataset … name it say AM1 . You should get this expression
Now take the reciprocal of each observation … so ….. xi becomes 1/xi. Isn’t it ??? Now make a separate dataset with these reciprocal observations that is 1/x1 , 1/x2 , …. , 1/xn . Calculate the AM of this new modified dataset … What does it come out to be …. Just replace the xi with 1/xi and you get AM for this new modified dataset right ?? Name this AM as AM2 . Almost done … Patience patience … Just take the reciprocal of this AM2 . You have successfully got the complicated HM with the help of this easy step by step explanation . Now go to the definition and read it at least 3 times slowly and now you know what HM means … Got it ??
And this is how we come to the end of our discussion on Measure of Central Tendency of Descriptive Statistics . This is my first article on Medium guys . If you have come up to this line then I am sure you have understood Descriptive Statistics focusing Measures of Central Tendency . If you have … just a small request to you … comment what you feel after reading this article and applaud if this article helped you … It will motivate me too … I have tried to explain things clearly however comments and positive criticism is always welcomed. So this is all for today … See you in the next article very soon !! Till then bye and have a good day !!