Statistics for Data Science using Python – Part I

Statistics is the study of collecting, analyzing, presenting, and interpreting and organization of data.

Data is a collection of facts, such as numbers, words, measurements, observations etc. Data is organised by the following ways.

  • Classification
  • Tabulation
  • Graphical Presentation (Histogram, Frequency Polygon, Frequency Curve etc)
  • Diagrammatical Presentation (Bar diagram, Pie diagram etc)

Types of Data

  • Categorical (Qualitative) – Quality
    • Nominal: Nominal values represent discrete units and are used to label variables, that have no quantitative value.
      • Eg: Color, Gender, Occupation, Computer Brands
    • Ordinal: Ordinal values represent discrete and ordered units.
      • Eg: Education with Order, like Undergraduate, Graduate, Post Graduate
  • Numerical (Quantitative) – Quantity
    • Discreate: It is a count that can’t be made more precise.
      • Eg: Number of children in a family ( 3 children)
    • Continous: It can be divided and reduced to finer and finer levels.
      • Eg: Height of the children in centimeters or inches
      • Eg: Weight of the children
      • Eg: Temperature in a room

Population is the set of sources from which data has to be collected.
Sample is a subset of the Population
Variable is any characteristics, number, or quantity that can be measured or counted.
Parameter is the measure of some characteristic of the population
Distribution is a function that shows the possible values for a variable and how often they occur.
Outliers: An outlier is an observation that lies an abnormal distance from other values in a random sample from a population. Outliers can have many causes, such as Input error, data corruption or true outlier observation.

Categories In Statistics

  • Descriptive Statistics
    • Measures Of The Central Tendency or Averages
      • Mean
      • Median
      • Mode
    • Measures Of The Dispersion
      • Range
      • Percentiles, Quartiles, Inter Quartile Range (IQR)
      • Variance
      • Standard Deviation
    • Measures to Describe Shape of Distribution
      • Skewness
      • Kurtosis
  • Inferential Statistics
    • Estimation
    • Hypothesis Testing

Calulation of Mean, Median and Mode
There are various libraries in python such as pandas, numpy, statistics that support Mean, Median calculation. We can calculate mode using numpy (scipy), pandas, and statistics.

Measures of Central Tendency is a measure of central tendency which gives us a rough idea where data points are centred.

Mean: The mean (or average) of a number of observations is the sum of the values of all the observations divided by the total number of observations.

Median: The median is that value of the given number of observations, which divides it into exactly two parts. When the data is arranged in ascending (or descending) order the median of ungrouped data is calculated as follows.

  • When the number of observations (n) is odd, the median is the value of the ((n+1)/2) th observation
  • When the number of observations (n) is even, the median is the mean of the (n/2) th and (n/2 + 1) th observations.

Mode: The mode is that value of the observation which occurs most frequently, i.e., an observation with the maximum frequency is called the mode.

Using Statistics

# Mean, Median and Mode

>>> import statistics
>>> import statistics as st
>>> st.mean([1,2,3,4,5,6,7,8,9,10])
5.5
>>> st.median([1,2,3,4,5,6,7,8,9,10])
5.5
>>> st.median([1,2,3,4,5])
3
>>> st.median([1,2,3,4])
2.5
>>> st.mode([1,1,1,1,2,2,2,2,2,2,2,2,3,4,5,6,7,8])
2

Using Numpy & Scipy

# Mean, Median and Mode
>>> import numpy as np
>>> lst = [2,4,6,8,10]
>>> mean = np.mean(lst)
>>> print(mean)
6.0
>>> median = np.median(lst)
>>> print(median)
6.0
>>> from scipy import stats
>>> mode = stats.mode([10,20,30,30,30,30,30,10,23,14])
>>> print(mode)
ModeResult(mode=array([30]), count=array([5]))

Calculation of Range, Percentiles, Quartiles, IQR, Variance and Standard Deviation

Measures of Dispersion: The dispersion or scatter in a data is measured on the basis of the observations and the types of the measure of central tendency used.

Range: It is the difference between the maximum and minimum values in the sample.

Percentile: A percentile is a measure used in statistics indicating the value below which a given percentage of observations in a group of observations falls. For example, the 20th percentile is the value below which 20% of the observations may be found. The 25th percentile is known as the first quartile (Q1), the 50th percentile as the median or second quartile (Q2), and the 75th percentile as the third quartile (Q3).

Quartiles: One of the four divisions of observations which have been grouped into four equal-sized sets based on their statistical rank. The first quartile (Q1) is defined as the middle number between the smallest number and the median of the data set. The second quartile (Q2) is the median of the data. The third quartile (Q3) is the middle value between the median and the highest value of the data set.

IQR (Q3 – Q1): It is the difference between the third quartile and the first quartile.

Variance: It is the average of the squared differences from the mean.

Standard Deviation: It is the square root of variance

Using Statistics

>>> import statistics as st
>>> st.stdev([1.5, 2.5, 3.5, 4.5, 3.25, 5.25])
1.347837774610382
>>> st.variance([1.5, 2.5, 3.5, 4.5, 3.25, 5.25])
1.8166666666666667

Using Numpy & Scipy

# Range
>>> lst = [10,14,11,7,9.5,15,19]
>>> a = np.amin(lst)
>>> print(a)
7.0
>>> b = np.amax(lst)
>>> print(b)
19.0
>>> range = b - a
>>> print(range)
12.0

>>> lst = [10,14,11,7,9.5,15,19]
>>> import numpy as np
>>> range = np.ptp(lst)
>>> print(range)
12.0

>>> A = np.array([[1,2,3,4],[5,6,7,8],[9,10,11,12]])
>>> print(A)
[[ 1  2  3  4]
 [ 5  6  7  8]
 [ 9 10 11 12]]
>>> range1 = np.ptp(A, axis = 0)
>>> print(range1)
[8 8 8 8]

>>> range2 = np.ptp(A, axis = 1)
>>> print(range2)
[3 3 3]

# Percentile & Quartiles

>>> data = [8,9,12,14.5,15.5,17,18]
>>> A = np.array([data])
>>> print(A)
[[ 8.   9.  12.  14.5 15.5 17.  18. ]]
>>> a = np.percentile(A,27, axis = 0, interpolation ='lower')
>>> print(a)
[ 8.   9.  12.  14.5 15.5 17.  18. ]
>>> a = np.percentile(A,27, axis = 1, interpolation ='lower')
>>> print(a)
[9.]
>>> a = np.percentile(A,50, axis = 1, interpolation ='lower')
>>> print(a)
[14.5]
>>> a = np.percentile(A,75, axis = 1, interpolation ='lower')
>>> print(a)
[15.5]
>>> a = np.percentile(A,100, axis = 1, interpolation ='lower')
>>> print(a)
[18.]

# Inter Quartile Range (IQR)

>>> from scipy.stats import iqr
>>> aIQR = iqr(A, axis =1, rng=(25,75), interpolation='lower')
>>> print(aIQR)
[6.5]

# Variance

>>> aVar = np.var(A, axis = 1)
>>> print(aVar)
[12.8877551]

# Standard Deviation

>>> aStd = np.std(A, axis = 1)
>>> print(aStd)
[3.58995196]
>>>

Calculation of Skewness

Skewness is the degree of asymmetry in a symmetrical bell curve, or normal distribution, in a sample set of data. If the curve is shifted to the left or to the right, it is said to be skewed. It can be positive (right skewed), negative (left skewed), or zero (unskewed).

>>> import scipy
>>> data = np.array([[10,12,13,14,18,9.4,19],[8, 9, 16, 12, 13,14.5,15],[10.4,11, 12, 17,15, 7,3.4],[14,11,5,10,10.5,8,11]])
>>> print(data)
[[10.  12.  13.  14.  18.   9.4 19. ]
 [ 8.   9.  16.  12.  13.  14.5 15. ]
 [10.4 11.  12.  17.  15.   7.   3.4]
 [14.  11.   5.  10.  10.5  8.  11. ]]
>>> dataT = data.transpose()
>>> print(dataT)
[[10.   8.  10.4 14. ]
 [12.   9.  11.  11. ]
 [13.  16.  12.   5. ]
 [14.  12.  17.  10. ]
 [18.  13.  15.  10.5]
 [ 9.4 14.5  7.   8. ]
 [19.  15.   3.4 11. ]]
>>> skewData = scipy.stats.skew(dataT,axis=0)
>>> print(skewData)
[ 0.39272032 -0.43192995 -0.29031462 -0.46285376]

Calculation of Kurtosis

Kurtosis referes to the peakedness or flatness of a frequency distribution curve when compared with normal distribution curve.

  • If the distribution is more peaked than normal, then it is said to be Leptokurtic.
  • If the distribution is more flat than the normal distribution, then it is known as Platykurtic distribution.
  • A normal curve is known as Mesokurtic.
>>> import scipy
>>> data = np.array([[10,12,13,14,18,9.4,19],[8, 9, 16, 12, 13,14.5,15],[10.4,11, 12, 17,15, 7,3.4],[14,11,5,10,10.5,8,11]])
>>> print(data)
[[10.  12.  13.  14.  18.   9.4 19. ]
 [ 8.   9.  16.  12.  13.  14.5 15. ]
 [10.4 11.  12.  17.  15.   7.   3.4]
 [14.  11.   5.  10.  10.5  8.  11. ]]
>>> dataT = data.transpose()
>>> print(dataT)
[[10.   8.  10.4 14. ]
 [12.   9.  11.  11. ]
 [13.  16.  12.   5. ]
 [14.  12.  17.  10. ]
 [18.  13.  15.  10.5]
 [ 9.4 14.5  7.   8. ]
 [19.  15.   3.4 11. ]]

#Pearson Kurtosis
>>> kurtosisPearsonData = scipy.stats.kurtosis(dataT, axis =0, fisher = False) 
>>> print(kurtosisPearsonData)
[1.75925913 1.73934746 2.17962191 2.76773719]

#Fishers Kurtosis
>>> kurtosisFishersData = scipy.stats.kurtosis(dataT, axis =1)
>>> print(kurtosisFishersData)
[-0.95156695 -0.90304709 -0.91692308 -1.25548083 -1.24277613 -0.89479212
 -1.17341967]

Note: Above code snippets you can execute in Python Interpreter or in Jupyter Notebook. Before executing the code, you need to install the respective libraries.

References

  • https://docs.python.org/3/library/statistics.html
  • https://docs.scipy.org/doc/

Learn more about Inferential Statistics in the next Blog Article. Also in our future Blog posts, we will publish detailed articles on Data Science Libraries like Numpy, Pandas, Matplotlib, Seaborn, Scipy, Scikit-learn etc.,

Happy Learning!