Statistics in Data Science [Part 1]

It is said that Statistics is the semantics of Data Science. So the first question that immediately comes to mind is “What is Statistics?”.

Statistics is a set of mathematical methods and tools that warrant us to answer crucial questions about data.

The next thing that probably went through your head was “What kind of statistics should I know or learn in particular for data science?”. That should not be the case, instead one should just learn statistics because it is the mastery of cracking the mystery inside datasets.

Statistics is divided into two categories, namely:

1. Descriptive Statistics — The data is summarized through the given observations. It is a way to describe, organize and represent a collection of data using graphs, tables and summary measures.

2. Inferential Statistics — This interprets the meaning of Descriptive Statistics. This method offers the use of information collected from a sample to make decisions, predictions or inferences from a population.

If you are a novice data scientist(somewhat like me 😄) you are thinking, “I did not see any mention on regression, is that not part of statistics?”. Linear regression, multiple linear regression and forecasting fall into a bracket known as statistical analysis or statistical modeling. These methods are seen more as standard analysis tools for inferential statistics than a branch of Statistics.

This is going to be a series of posts delving into each of the categories or branches of Statistics. In this next part of my post I will walk you through on how to carry out Descriptive Statistics in Python. Please note that you can use any programming language you are familiar with or one that you are interested in learning about, Python is just preferable for me.

I will take an example of the CO2 Emissions by Vehicles dataset, which you can download here

, to demonstrate my comprehension of the notions and approaches I have grasped. This dataset represents the details of how CO2 emissions by a vehicle vary with respect to the different features.

Step 1 — Importing and Processing Data Set

Import the necessary libraries (pandas, numpy, matplotlib and seaborn) and load the dataset.

Step 2 — Exploring the Data Set

.shape function returns the complete number of rows and columns in the dataset.

.info() function lets us know the data types associated with our columns and also if the dataset contains null values or not.

Step 3 — Descriptive Statistics

The pandas function, .describe(), is very convenient in obtaining various summary statistics associated with the columns or features of our dataset.

i) Measures of Central Tendency

ii) Measures of Dispersion

iii) Correlation

|Passionate about all things DATA| Data Science Enthusiast| Absorbs knowledge like a sponge😁|

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store