Statistics in Data Science [Part 1]

It is said that Statistics is the semantics of Data Science. So the first question that immediately comes to mind is “What is Statistics?”.

The next thing that probably went through your head was “What kind of statistics should I know or learn in particular for data science?”. That should not be the case, instead one should just learn statistics because it is the mastery of cracking the mystery inside datasets.

Statistics is divided into two categories, namely:

1. Descriptive Statistics — The data is summarized through the given observations. It is a way to describe, organize and represent a collection of data using graphs, tables and summary measures.

  • Under Descriptive Statistics we have measures of central tendency (mean, median and mode), measures of dispersion (variance, standard deviation, interquartile range and skewness) and correlation.

2. Inferential Statistics — This interprets the meaning of Descriptive Statistics. This method offers the use of information collected from a sample to make decisions, predictions or inferences from a population.

  • Hypothesis testing, z-test, t-test and confidence intervals are all found under inferential statistics.

If you are a novice data scientist(somewhat like me 😄) you are thinking, “I did not see any mention on regression, is that not part of statistics?”. Linear regression, multiple linear regression and forecasting fall into a bracket known as statistical analysis or statistical modeling. These methods are seen more as standard analysis tools for inferential statistics than a branch of Statistics.

This is going to be a series of posts delving into each of the categories or branches of Statistics. In this next part of my post I will walk you through on how to carry out Descriptive Statistics in Python. Please note that you can use any programming language you are familiar with or one that you are interested in learning about, Python is just preferable for me.

I will take an example of the CO2 Emissions by Vehicles dataset, which you can download here

, to demonstrate my comprehension of the notions and approaches I have grasped. This dataset represents the details of how CO2 emissions by a vehicle vary with respect to the different features.

Step 1 — Importing and Processing Data Set

Import the necessary libraries (pandas, numpy, matplotlib and seaborn) and load the dataset.

  • The dataset is comma separated.
  • By utilizing the .head() function from the pandas library it displays the first five data entries. Utilizing the .tail() function displays the last five data entries.

Step 2 — Exploring the Data Set

.shape function returns the complete number of rows and columns in the dataset.

  • The dataset is made up of 7385 data entries and 12 columns or features.

.info() function lets us know the data types associated with our columns and also if the dataset contains null values or not.

  • Dataset is comprised of integer, float and object values.
  • There are no null and missing values in the columns.

Step 3 — Descriptive Statistics

The pandas function, .describe(), is very convenient in obtaining various summary statistics associated with the columns or features of our dataset.

i) Measures of Central Tendency

  • The summary statistics displayed are for the features that are numerical (float and integer data types).
  • Mean is higher than median (50% quartile) for all other features except Cylinders (number of cylinders), the difference being attributed to the distribution of data.
  • Maximum values for all features are higher than that of the 75% quartile, this implies that there are outliers in our dataset.
  • Mode constitutes the most frequent value of a feature in the dataset and can be used with categorical features.
  • Ford is the most frequent make, for transmission it is the Automatic with Select Shift and fuel type we have regular gasoline.

ii) Measures of Dispersion

  • Our standard deviation numbers (from the .describe() summary) could be considered small which means that our set of values are close to the mean, on average.
  • The variance is just a square of the standard deviation and the same interpretation applies.
  • In the creation of the box-plots (to better visualize the interquartile range) , I saved the features as their own variables for convenience.
  • The Interquartile Range (IQR) is a measure of statistical dispersion, and is calculated as the difference between the upper quartile (75th percentile) and the lower quartile (25th percentile).
  • It is better visualized with box-plots as one will be able to identify outliers.
  • As can be seen on the box-plots, all features contain outliers. These data points differ significantly from the rest.
  • Skewness is the measure of the symmetry, or lack of , for a random variable about its mean.
  • The value can be positive, negative or undefined.
  • Cylinders and Fuel Consumption in Highway Roads exhibit a highly skewed distribution, whereas the rest of the features exhibit a moderately skewed distribution.

iii) Correlation

  • Correlation between variables indicates that as one feature changes in value, the other feature tends to change in a specific direction.
  • Positive correlation coefficients represent that when the value of one feature increases, the value of the other feature also tends to increase.
  • For negative correlation coefficients when the value of one feature increases, the value of the other feature tends to decrease.
  • There is high positive correlation between the carbon dioxide emissions (CO2 Emissions(g/km)) and the rest of the features except Fuel Consumption Comb(mpg) where there is high negative correlation.
  • Fuel Consumption Comb(mpg) displays high negative correlation with all the other features.

|Passionate about all things DATA| Data Science Enthusiast| Absorbs knowledge like a sponge😁|